Re: [HACKERS] multivariate statistics / patch v7

2015-09-24 Thread Josh Berkus
Tomas,

> attached is v7 of the multivariate stats patch. The main improvement is
> major refactoring of the clausesel.c portion - splitting the awfully
> long spaghetti-style functions into smaller pieces, making it much more
> understandable etc.

So presumably v7 handles varlena attributes as well, yes?   I have a
destruction test case for correlated column stats, so I'd like to test
your patch on it.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics / patch v7

2015-09-24 Thread Tomas Vondra

Hi,

On 09/24/2015 06:43 PM, Josh Berkus wrote:

Tomas,


attached is v7 of the multivariate stats patch. The main improvement is
major refactoring of the clausesel.c portion - splitting the awfully
long spaghetti-style functions into smaller pieces, making it much more
understandable etc.


So presumably v7 handles varlena attributes as well, yes?   I have a
destruction test case for correlated column stats, so I'd like to test
your patch on it.


Yes, it should handle varlena OK. Let me know if you need help with 
that, and I'd like to hear feedback - whether it fixed your test case or 
not, etc.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics / patch v7

2015-08-25 Thread Michael Paquier
On Fri, Jul 31, 2015 at 6:28 AM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:
 [series of arguments]

 If you need stats without these issues you'll have to use MCV list or a
 histogram. Trying to fix the simple statistics types is futile, IMHO.

Patch is marked as returned with feedback. There has been advanced
discussions and reviews as well.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics / patch v7

2015-07-30 Thread Tomas Vondra

Hi,

On 07/30/2015 10:21 AM, Heikki Linnakangas wrote:

On 05/25/2015 11:43 PM, Tomas Vondra wrote:

There are 6 files attached, but only 0002-0006 are actually part of the
multivariate statistics patch itself.


All of these patches are huge. In order to review this in a reasonable
amount of time, we need to do this in several steps. So let's see what
would be the minimal set of these patches that could be reviewed and
committed, while still being useful.

The main patches are:

1. shared infrastructure and functional dependencies
2. clause reduction using functional dependencies
3. multivariate MCV lists
4. multivariate histograms
5. multi-statistics estimation

Would it make sense to commit only patches 1 and 2 first? Would that be
enough to get a benefit from this?


I agree that the patch can't be reviewed as a single chunk - that was 
the idea when I split the original (single chunk) patch into multiple 
smaller pieces.


And yes, I believe committing pieces 12 might be enough to get 
something useful, which can then be improved by adding the usual MCV 
and histogram stats on top of that.



I have some doubts about the clause reduction and functional
dependencies part of this. It seems to treat functional dependency as
a boolean property, but even with the classic zipcode and city case,
it's not always an all or nothing thing. At least in some countries,
there can be zipcodes that span multiple cities. So zipcode=X does
not completely imply city=Y, although there is a strong correlation
(if that's the right term). How strong does the correlation need to
be for this patch to decide that zipcode implies city? I couldn't
actually see a clear threshold stated anywhere.

So rather than treating functional dependence as a boolean, I think
it would make more sense to put a 0.0-1.0 number to it. That means
that you can't do clause reduction like it's done in this patch,
where you actually remove clauses from the query for cost esimation
purposes. Instead, you need to calculate the selectivity for each
clause independently, but instead of just multiplying the
selectivities together, apply the dependence factor to it.

Does that make sense? I haven't really looked at the MCV, histogram
and multi-statistics estimation patches yet. Do those patches make
the clause reduction patch obsolete? Should we forget about the
clause reduction and functional dependency patch, and focus on those
later patches instead?


Perhaps. It's true that most real-world data sets are not 100% valid 
with respect to functional dependencies - either because of natural 
imperfections (multiple cities with the same ZIP code) or just noise in 
the data (incorrect entries ...). And it's even mentioned in the code 
comments somewhere, I guess.


But there are two main reasons why I chose not to extend the functional 
dependencies with the [0.0-1.0] value you propose.


Firstly, functional dependencies were meant to be the simplest possible 
implementation, illustrating how the infrastructure is supposed to 
work (which is the main topic of the first patch).


Secondly, all kinds of statistics are simplifications of the actual 
data. So I think it's not incorrect to ignore the exceptions up to some 
threshold.


I also don't think this will make the estimates globally better. Let's 
say you have 1% of rows that contradict the functional dependency - you 
may either ignore them and have good estimates for 99% of the values and 
incorrect estimates for 1%, or tweak the rule a bit and make the 
estimates worse for 99% (and possibly better for 1%).


That being said, I'm not against improving the functional dependencies. 
I already do have some improvements on my TODO - like for example 
dependencies on more columns (not just A=B but [A,B]=C and such), but 
I think we should not squash this into those two patches.


And yet another point - ISTM these cases might easily be handled better 
by the statistics based on ndistinct coefficients, as proposed by 
Kyotaro-san some time ago. That is, compute and track


ndistinct(A) * ndistinct(B) / ndistinct(A,B)

for all pairs of columns (or possibly larger groups). That seems to be 
similar to the coefficient you propose.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics / patch v7

2015-07-30 Thread Tomas Vondra

Hi,

On 07/30/2015 06:58 PM, Heikki Linnakangas wrote:


The problem with a threshold is that around that threshold, even a
small change in the data set can drastically change the produced
estimates. For example, imagine that we know from the stats that zip
code implies city. But then someone adds a single row to the table
with an odd zip code  city combination, which pushes the estimator
over the threshold, and the columns are no longer considered
dependent, and the estimates are now completely different. We should
avoid steep cliffs like that.

BTW, what is the threshold in the current patch?


There's not a simple threshold - the algorithm mining the functional 
dependencies is a bit more complicated. I tried to explain it in the 
comment before build_mv_dependencies (in dependencies.c), but let me 
briefly summarize it here.


To mine dependency [A = B], build_mv_dependencies does this:

(1) sort the sample by {A,B}

(2) split the sample into groups with the same value of A

(3) for each group, decide if it's consistent with the dependency

(a) if the group is too small (less than 3 rows), ignore it

(a) if the group is consistent, update

n_supporting
n_supporting_rows

(b) if the group is inconsistent, update

n_contradicting
n_contradicting_rows

(4) decide whether the dependency is valid by checking

n_supporting_rows = n_contradicting_rows * 10

The limit is rather arbitrary and yes - I can imagine a more complex 
condition (e.g. looking at average number of tuples per group etc.), but 
I haven't looked into that - the point was to use something very simple, 
only to illustrate the infrastructure.


I think we might come up with some elaborate way of associating degree 
with the functional dependency, but at that point we really loose the 
simplicity, and also make it indistinguishable from the remaining 
statistics (because it won't be possible to reduce the clauses like 
this, before performing the regular estimation). Which is exactly what 
makes the functional dependencies so neat and efficient, so I'm not 
overly enthusiastic about doing that.


What seems more interesting is implementing the ndistinct coefficient 
instead, as proposed by Kyotaro-san - that seems to have the nice 
smooth behavior you desire, while keeping the simplicity.


Both statistics types (functional dependencies and ndistinct coeff) have 
one weak point, though - they somehow assume the queries use 
compatible values. For example if you use a query with


   WHERE city = 'New York' AND zip = 'zip for Detroit'

they can't detect cases like this, because those statistics types are 
oblivious to individual values. I don't see this as a fatal flaw, though 
- it's rather a consequence of the nature of the stats. And I tend to 
look at the functional dependencies the same way.


If you need stats without these issues you'll have to use MCV list or 
a histogram. Trying to fix the simple statistics types is futile, IMHO.


regards
Tomas

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics / patch v7

2015-07-30 Thread Heikki Linnakangas

On 05/25/2015 11:43 PM, Tomas Vondra wrote:

There are 6 files attached, but only 0002-0006 are actually part of the
multivariate statistics patch itself.


All of these patches are huge. In order to review this in a reasonable 
amount of time, we need to do this in several steps. So let's see what 
would be the minimal set of these patches that could be reviewed and 
committed, while still being useful.


The main patches are:

1. shared infrastructure and functional dependencies
2. clause reduction using functional dependencies
3. multivariate MCV lists
4. multivariate histograms
5. multi-statistics estimation

Would it make sense to commit only patches 1 and 2 first? Would that be 
enough to get a benefit from this?


I have some doubts about the clause reduction and functional 
dependencies part of this. It seems to treat functional dependency as a 
boolean property, but even with the classic zipcode and city case, it's 
not always an all or nothing thing. At least in some countries, there 
can be zipcodes that span multiple cities. So zipcode=X does not 
completely imply city=Y, although there is a strong correlation (if 
that's the right term). How strong does the correlation need to be for 
this patch to decide that zipcode implies city? I couldn't actually see 
a clear threshold stated anywhere.


So rather than treating functional dependence as a boolean, I think it 
would make more sense to put a 0.0-1.0 number to it. That means that you 
can't do clause reduction like it's done in this patch, where you 
actually remove clauses from the query for cost esimation purposes. 
Instead, you need to calculate the selectivity for each clause 
independently, but instead of just multiplying the selectivities 
together, apply the dependence factor to it.


Does that make sense? I haven't really looked at the MCV, histogram and 
multi-statistics estimation patches yet. Do those patches make the 
clause reduction patch obsolete? Should we forget about the clause 
reduction and functional dependency patch, and focus on those later 
patches instead?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics / patch v7

2015-07-30 Thread Heikki Linnakangas

On 07/30/2015 03:55 PM, Tomas Vondra wrote:

On 07/30/2015 10:21 AM, Heikki Linnakangas wrote:

I have some doubts about the clause reduction and functional
dependencies part of this. It seems to treat functional dependency as
a boolean property, but even with the classic zipcode and city case,
it's not always an all or nothing thing. At least in some countries,
there can be zipcodes that span multiple cities. So zipcode=X does
not completely imply city=Y, although there is a strong correlation
(if that's the right term). How strong does the correlation need to
be for this patch to decide that zipcode implies city? I couldn't
actually see a clear threshold stated anywhere.

So rather than treating functional dependence as a boolean, I think
it would make more sense to put a 0.0-1.0 number to it. That means
that you can't do clause reduction like it's done in this patch,
where you actually remove clauses from the query for cost esimation
purposes. Instead, you need to calculate the selectivity for each
clause independently, but instead of just multiplying the
selectivities together, apply the dependence factor to it.

Does that make sense? I haven't really looked at the MCV, histogram
and multi-statistics estimation patches yet. Do those patches make
the clause reduction patch obsolete? Should we forget about the
clause reduction and functional dependency patch, and focus on those
later patches instead?


Perhaps. It's true that most real-world data sets are not 100% valid
with respect to functional dependencies - either because of natural
imperfections (multiple cities with the same ZIP code) or just noise in
the data (incorrect entries ...). And it's even mentioned in the code
comments somewhere, I guess.

But there are two main reasons why I chose not to extend the functional
dependencies with the [0.0-1.0] value you propose.

Firstly, functional dependencies were meant to be the simplest possible
implementation, illustrating how the infrastructure is supposed to
work (which is the main topic of the first patch).

Secondly, all kinds of statistics are simplifications of the actual
data. So I think it's not incorrect to ignore the exceptions up to some
threshold.


The problem with a threshold is that around that threshold, even a small 
change in the data set can drastically change the produced estimates. 
For example, imagine that we know from the stats that zip code implies 
city. But then someone adds a single row to the table with an odd zip 
code  city combination, which pushes the estimator over the threshold, 
and the columns are no longer considered dependent, and the estimates 
are now completely different. We should avoid steep cliffs like that.


BTW, what is the threshold in the current patch?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics / patch v7

2015-07-27 Thread Tomas Vondra

Hello Horiguchi-san,

On 07/27/2015 09:04 AM, Kyotaro HORIGUCHI wrote:

Hello,

At Sat, 25 Jul 2015 23:09:31 +0200, Tomas Vondra tomas.von...@2ndquadrant.com wrote 
in 55b3fb0b.7000...@2ndquadrant.com

Hi,

On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:

Hi, I'd like to show you the modified constitution of
multivariate statistics application logic. Please find the
attached. They apply on your v7 patch.


Sadly I do have some trouble getting it to apply correctly :-(
So for now all my comments are based on just reading the code.


Ah. My modification to rebase to the master for the time should
be culprit. Sorry for the dirty patch.

# I would recreate the patch if you complained before struggling
# with the thing..

The core of the modification is on closesel.c. I attached the
patched closesel.c.


I don't see any attachment. Perhaps you forgot to actually attach it?



My concern about the code at the time was following,

- You embedded the logic of multivariate estimation into
   clauselist_selectivity. I think estimate using multivariate
   statistics is quite different from the ordinary estimate based
   on single column stats then they are logically separatable and
   we should do so.


I don't see them as very different, actually quite the opposite. The two 
kinds of statistics are complementary and should naturally coexist. 
Perhaps the current code is not perfect and a refactoring would make the 
code more readable, but I don't think it's primary aim should be to 
separate regular and multivariate stats.




- You are taking top-down approach and it runs tree-walking to
   check appliability of mv-stats for every stepping down in
   clause tree. If the subtree found to be mv-applicable, split it
   to two parts - mv-compatible and non-compatible. These steps
   requires expression tree walking, which looks using too-much
   CPU.


I'm taking top-down approach because that's what the regular stats do, 
and also because that's what allows implementing the features that I 
think are interesting - ability to combine multiple stats in an 
efficient way, pass conditions and such. I think those two features are 
very useful and allow very interesting things.


The bottom-up would work too, probably - I mean, we could start from 
leaves of the expression tree, and build the largest subtree 
compatible with multivariate stats and then try to estimate it. I don't 
see how we could pass conditions though, which works naturally in the 
top-down approach.


Or maybe a combination of both - identify the compatible subtrees 
first, then perform the top-down phase.



- You look to be considering the cases when users create many
   multivariate statistics on attribute sets having
   duplications. But it looks too-much for me. MV-stats are more
   resource-eating so we can assume the minimum usage of that.


Not really. I don't expect huge numbers of multivariate stats to be 
built on the tables.


But I think restricting the users to use a single multivariate 
statistics per table would be a significant limitation. And once you 
allow using multiple multivariate statistics for the set of clauses, 
supporting over-lapping stats is not that difficult.


What it however makes possible is combining multiple small stats into 
a larger one in a very efficient way - it assumes the overlap is 
sufficient, of course. But if that's true you may build multiple small 
(and very accurate) stats instead of one huge (or very inaccurate) 
statistics.


This also makes it possible to handle complex combinations of clauses 
that are compatible and incompatible with multivariate statistics, by 
passing the conditions.




My suggestion in the patch is a bottom-up approach to find
mv-applicable portion(s) in the expression tree, which is the
basic way of planner overall. The approach requires no repetitive
run of tree walker, that is, pull_varnos. It could fail to find
the 'optimal' solution for complex situations but needs far less
calculation for almost the same return (I think..).

Even though it doesn't consider the functional dependency, the
reduce of the code shows the efficiency. It does not nothing
tricky.


OK


On a conceptual level, I think the idea to split the estimation into
two phases - enrich the expression tree with nodes with details about
stats etc, and then actually do the estimation in the second phase
might be interesting. Not because it's somehow clearer, but because it
gives us a chance to see the expression tree as a whole, with details
about all the stats (with the current code we process/estimate the
tree incrementally). But I don't really know how useful that would be.


It is difficult to say which approach is better sinch it is
affected by what we think important than other things. However I
concern about that your code substantially reconstructs the
expression (clause) tree midst of processing it. I believe it
should be a separate phase for simplicity. Of course additional
required resource is also 

Re: [HACKERS] multivariate statistics / patch v7

2015-07-27 Thread Kyotaro HORIGUCHI
Hello,

At Sat, 25 Jul 2015 23:09:31 +0200, Tomas Vondra tomas.von...@2ndquadrant.com 
wrote in 55b3fb0b.7000...@2ndquadrant.com
 Hi,
 
 On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:
  Hi, I'd like to show you the modified constitution of
  multivariate statistics application logic. Please find the
  attached. They apply on your v7 patch.
 
 Sadly I do have some trouble getting it to apply correctly :-(
 So for now all my comments are based on just reading the code.

Ah. My modification to rebase to the master for the time should
be culprit. Sorry for the dirty patch. 

# I would recreate the patch if you complained before struggling
# with the thing..

The core of the modification is on closesel.c. I attached the
patched closesel.c.

 FWIW I've rebased my patch to the current master, it's available on
 github as usual:
 
 https://github.com/tvondra/postgres/commits/mvstats

Thanks.

  The code to find mv-applicable clause is moved out of the main
  flow of clauselist_selectivity. As I said in the previous mail,
  the new function transformRestrictInfoForEstimate (too bad name
  but just for PoC:) scans clauselist and generates
  RestrictStatsData struct which drives mv-aware selectivity
  calculation. This struct isolates MV and non-MV estimation.
 
  The struct RestrictStatData mainly consists of the following
  three parts,
 
 - clause to be estimated by current logic (MV is not applicable)
 - clause to be estimated by MV-staistics.
 - list of child RestrictStatDatas, which are to be run
   recursively.
 
  mvclause_selectivty() is the topmost function where mv stats
  works. This structure effectively prevents main estimation flow
  from being broken by modifying mvstats part. Although I haven't
  measured but I'm positive the code is far reduced from yours.
 
  I attached two patches to this message. The first one is to
  rebase v7 patch to current(maybe) master and the second applies
  the refactoring.
 
  I'm a little anxious about performance but I think this makes the
  process to apply mv-stats far clearer. Regtests for mvstats
  succeeded asis except for fdep, which is not implememted in this
  patch.
 
  What do you think about this?
 
 I'm not sure, at this point. I'm having a hard time understanding how
 exactly the code works - there are pretty much no comments explaining
 the implementation, so it takes time to understand the code. This is
 especially true about transformRestrictInfoForEstimate which is also
 quite long. I understand it's a PoC, but comments would really help.

The patch itself shold hardly readable because it's not from
master but from your last patch plus somthing.

My concern about the code at the time was following,

- You embedded the logic of multivariate estimation into
  clauselist_selectivity. I think estimate using multivariate
  statistics is quite different from the ordinary estimate based
  on single column stats then they are logically separatable and
  we should do so.

- You are taking top-down approach and it runs tree-walking to
  check appliability of mv-stats for every stepping down in
  clause tree. If the subtree found to be mv-applicable, split it
  to two parts - mv-compatible and non-compatible. These steps
  requires expression tree walking, which looks using too-much
  CPU.

- You look to be considering the cases when users create many
  multivariate statistics on attribute sets having
  duplications. But it looks too-much for me. MV-stats are more
  resource-eating so we can assume the minimum usage of that.

My suggestion in the patch is a bottom-up approach to find
mv-applicable portion(s) in the expression tree, which is the
basic way of planner overall. The approach requires no repetitive
run of tree walker, that is, pull_varnos. It could fail to find
the 'optimal' solution for complex situations but needs far less
calculation for almost the same return (I think..).

Even though it doesn't consider the functional dependency, the
reduce of the code shows the efficiency. It does not nothing
tricky.

 On a conceptual level, I think the idea to split the estimation into
 two phases - enrich the expression tree with nodes with details about
 stats etc, and then actually do the estimation in the second phase
 might be interesting. Not because it's somehow clearer, but because it
 gives us a chance to see the expression tree as a whole, with details
 about all the stats (with the current code we process/estimate the
 tree incrementally). But I don't really know how useful that would be.

It is difficult to say which approach is better sinch it is
affected by what we think important than other things. However I
concern about that your code substantially reconstructs the
expression (clause) tree midst of processing it. I believe it
should be a separate phase for simplicity. Of course additional
required resource is also should be considered but it is rather
reduced for this case.

 I don't think the proposed change makes the 

Re: [HACKERS] multivariate statistics / patch v7

2015-07-25 Thread Tomas Vondra

Hi,

On 07/16/2015 01:51 PM, Kyotaro HORIGUCHI wrote:

Hi, I'd like to show you the modified constitution of
multivariate statistics application logic. Please find the
attached. They apply on your v7 patch.


Sadly I do have some trouble getting it to apply correctly :-(
So for now all my comments are based on just reading the code.

FWIW I've rebased my patch to the current master, it's available on 
github as usual:


https://github.com/tvondra/postgres/commits/mvstats


The code to find mv-applicable clause is moved out of the main
flow of clauselist_selectivity. As I said in the previous mail,
the new function transformRestrictInfoForEstimate (too bad name
but just for PoC:) scans clauselist and generates
RestrictStatsData struct which drives mv-aware selectivity
calculation. This struct isolates MV and non-MV estimation.

The struct RestrictStatData mainly consists of the following
three parts,

   - clause to be estimated by current logic (MV is not applicable)
   - clause to be estimated by MV-staistics.
   - list of child RestrictStatDatas, which are to be run
 recursively.

mvclause_selectivty() is the topmost function where mv stats
works. This structure effectively prevents main estimation flow
from being broken by modifying mvstats part. Although I haven't
measured but I'm positive the code is far reduced from yours.

I attached two patches to this message. The first one is to
rebase v7 patch to current(maybe) master and the second applies
the refactoring.

I'm a little anxious about performance but I think this makes the
process to apply mv-stats far clearer. Regtests for mvstats
succeeded asis except for fdep, which is not implememted in this
patch.

What do you think about this?


I'm not sure, at this point. I'm having a hard time understanding how 
exactly the code works - there are pretty much no comments explaining 
the implementation, so it takes time to understand the code. This is 
especially true about transformRestrictInfoForEstimate which is also 
quite long. I understand it's a PoC, but comments would really help.


On a conceptual level, I think the idea to split the estimation into two 
phases - enrich the expression tree with nodes with details about stats 
etc, and then actually do the estimation in the second phase might be 
interesting. Not because it's somehow clearer, but because it gives us a 
chance to see the expression tree as a whole, with details about all the 
stats (with the current code we process/estimate the tree 
incrementally). But I don't really know how useful that would be.


I don't think the proposed change makes the process somehow clearer. I 
know it's a PoC at this point, so I don't expect it to be perfect, but 
for me the original code is certainly clearer. Of course, I'm biased as 
I wrote the current code, and I (naturally) shaped it to match my ideas 
during the development process, and I'm much more familiar with it.


Omitting the support for functional dependencies is a bit unfortunate, I 
think. Is that merely to make the PoC simpler, or is there something 
that makes it impossible to support that kind of stats?


Another thing that I noticed is that you completely removed the code 
that combined multiple stats (and selected the best combination of 
stats). In other words, you've reverted to the intermediate single 
statistics approach, including removing the improved handling of OR 
clauses and conditions. It's a bit difficult to judge the proposed 
approach not knowing how well it supports those (quite crucial) 
features. What if it can't support some them., or what if it makes the 
code much more complicated (thus defeating the goal of making it more 
clear)?


I share your concern about the performance impact - one thing is that 
this new code might be slower than the original one, but a more serious 
issue IMHO is that the performance impact will happen even for relations 
with no multivariate stats at all. The original patch was very careful 
about getting ~0% overhead in such cases, and if the new code does not 
allow that, I don't see this approach as acceptable. We must not put 
additional overhead on people not using multivariate stats.


But I think it's worth exploring this idea a bit more - can you rebase 
it to the current patch version (as on github) and adding the missing 
pieces (functional dependencies, multi-statistics estimation and passing 
conditions)?


One more thing - I noticed you extended the pg_operator catalog with a 
oprmvstat attribute, used to flag operators that are compatible with 
multivariate stats. I'm not happy with the current approach (using 
oprrest to do this decision), but I'm not really sure this is a good 
solution either. The culprit is that it only answers one of the two 
important questions - Is it compatible? How to perform the estimation?


So we'd have to rely on oprrest anyway, when actually performing the 
estimation of a clause with compatible operator. And we'd have to keep 
in 

Re: [HACKERS] multivariate statistics / patch v7

2015-07-14 Thread Tomas Vondra

Hi,

On 07/13/2015 10:51 AM, Kyotaro HORIGUCHI wrote:


Ok, I understood the diferrence between what I thought and what you
say. The code is actually concious of OR clause but is looks somewhat
confused.


I'm not sure which part is confused by the OR clauses, but it's 
certainly possible. Initially it only handled AND clauses, and the 
support for OR clauses was added later, so it's possible some parts are 
not behaving correctly.




Currently choosing mv stats in clauselist_selectivity can be
outlined as following,

1. find_stats finds candidate mv stats containing *all*
attributes appeared in the whole clauses regardless of and/or
exprs by walking whole the clause tree.

Perhaps this is the measure to early bailout.


Not entirely. The goal of find_stats() is to lookup all stats on the 
'current' relation - it's coded the way it is because I had to deal with 
varRelid=0 cases, in which case I have to inspect the Var nodes. But 
maybe I got this wrong and there's much simpler way to do that?


It is an early bailout in the sense that if there are no multivariate 
stats defined on the table, there's no point in doing any of the 
following steps. So that we don't increase planning times for users not 
using multivariate stats.



2.1. Within every disjunction elements, collect mv-related
attributes while checking whether the all leaf nodes (binop or
ifnull) are compatible by (eventually) walking whole the
clause tree.


Generally, yes. The idea is to check whether there are clauses that 
might be estimated using multivariate statistics, and whether the 
clauses reference at least two different attributes. Imagine a query 
like this:


   SELECT * FROM t WHERE (a=1) AND (a0) AND (a100)

It makes no sense to process this using multivariate statistics, because 
all the Var nodes reference a single attribute.


Similarly, the check is not just about the leaf nodes - to be able to 
estimate a clause at this point, we have to be able to process the whole 
tree, starting from the top-level clause. Although maybe that's no 
longer true, now that support for OR clauses was added ... I wonder 
whether there are other BoolExpr-like nodes, that might make the tree 
incompatible with multivariate statistics (in the sense that the current 
implementation does not know how to handle them).


Also note that even though the clause may be incompatible at this 
level, it may get partially processed by multivariate statistics later. 
For example with a query:


   SELECT * FROM t WHERE (a=1 OR b=2 OR c ~* 'xyz') AND (q=1 OR r=4)

the first query is incompatible because it contains unsupported 
operator '~*', but it will eventually be processed as BoolExpr node, and 
should be split into two parts - (a=1 OR b=2) which is compatible, and 
(c ~* 'xyz') which is incompatible.


This split should happen in clauselist_selectivity_or(), and the other 
thing that may be interesting is that it uses (q=1 OR r=4) as a 
condition. So if there's a statistics built on (a,b,q,r) we'll compute 
conditional probability


P(a=1,b=2 | q=1,r=4)


 2.2. Check if all the collected attribute are contained in
 mv-stats columns.

No, I think you got this wrong. We do not check that *all* the 
attributes are contained in mvstats columns - we only need two such 
columns (then there's a chance that the multivariate statistics will get 
applied).


Anyway, both 2.1 and 2.2 are meant as a quick bailout, before doing the 
most expensive part, which is choose_mv_statistics(). Which is however 
missing in this list.



3. Finally, clauseset_mv_selectivity_histogram() (and others).

This funciton applies every ExprOp onto every attribute in
every histogram backes and (tries to) make the boolean
operation of the result bitmaps.


Yes, but this only happens after choose_mv_statistics(), because that's 
the code that decides which statistics will be used and in what order.


The list is also missing handling of the 'functional dependencies', so a 
complete list of steps would look like this:


1) find_stats - lookup stats on the current relation (from RelOptInfo)

2) apply functional dependencies

   a) check if there are equality clauses that may be reduced using
  functional dependencies, referencing at least two columns

   b) if yes, perform the clause reduction

3) apply MCV lists and histograms

   a) check if there are clauses 'compatible' with those types of
  statistics, again containing at least two columns

   b) if yes, use choose_mv_statistics() to decide which statistics to
 apply and in which order

   c) apply the selected histograms and MCV lists

4) estimate the remaining clauses using the regular statistics

   a) this is where the clauselist_mv_selectivity_histogram and other
  are called

I tried to explain this in the comment before clauselist_selectivity(), 
but maybe it's not detailed enough / missing some important details.




I have some comments on the implement and I 

Re: [HACKERS] multivariate statistics / patch v7

2015-07-13 Thread Kyotaro HORIGUCHI
Hi, Thanks for the detailed explaination. I misunderstood the
code (more honest speaking, din't look so close there). Then I
looked it closer.


At Wed, 08 Jul 2015 03:03:16 +0200, Tomas Vondra tomas.von...@2ndquadrant.com 
wrote in 559c76d4.2030...@2ndquadrant.com
 FWIW this was a stupid bug in update_match_bitmap_histogram(), which
 initially handled only AND clauses, and thus assumed the match of a
 bucket can only decrease. But for OR clauses this is exactly the
 opposite (we assume no buckets match and add buckets matching at least
 one of the clauses).
 
 With this fixed, the estimates look like this:
 

 IMHO pretty accurate estimates - no issue with OR clauses.

Ok, I understood the diferrence between what I thought and what
you say. The code is actually concious of OR clause but is looks
somewhat confused.

Currently choosing mv stats in clauselist_selectivity can be
outlined as following,

1. find_stats finds candidate mv stats containing *all*
   attributes appeared in the whole clauses regardless of and/or
   exprs by walking whole the clause tree.

   Perhaps this is the measure to early bailout.

2.1. Within every disjunction elements, collect mv-related
   attributes while checking whether the all leaf nodes (binop or
   ifnull) are compatible by (eventually) walking whole the
   clause tree.

2.2. Check if all the collected attribute are contained in
   mv-stats columns.

3. Finally, clauseset_mv_selectivity_histogram() (and others).

   This funciton applies every ExprOp onto every attribute in
   every histogram backes and (tries to) make the boolean
   operation of the result bitmaps.

I have some comments on the implement and I also try to find the
solution for them.


1. The flow above looks doing  very similiar thins repeatedly.

2. I believe what the current code does can be simplified.

3. As you mentioned in comments, some additional infrastructure
   needed.

After all, I think what we should do after this are as follows,
as the first step.

- Add the means to judge the selectivity operator(?) by other
  than oprrest of the op of ExprOp. (You missed neqsel already)

  I suppose one solution for this is adding oprmvstats taking
  'm', 'h' and 'f' and their combinations. Or for the
  convenience, it would be a fixed-length string like this.

  oprname | oprmvstats
  =   | 'mhf'
| 'mhf'
 | 'mh-'
 | 'mh-'
  =  | 'mh-'
  =  | 'mh-'

  This would make the code in clause_is_mv_compatible like this.

   oprmvstats = get_mvstatsset(expr-opno); /* bitwise representation */
   if (oprmvstats  types)
   {
  *attnums = bms_add_member(*attnums, var-varattno);
  return true;
   }
   return false;

- Current design just manage to work but it is too complicated
  and hardly have affinity with the existing estimation
  framework. I proposed separation of finding stats phase and
  calculation phase, but I would like to propose transforming
  RestrictInfo(and finding mvstat) phase and running the
  transformed RestrictInfo phase after looking close to the
  patch.

  I think transforing RestrictInfo makes the situnation
  better. Since it nedds different information, maybe it is
  better to have new struct, say, RestrictInfoForEstimate
  (boo!). Then provide mvstatssel() to use in the new struct.
  The rough looking of the code would be like below. 

  clauselist_selectivity()
  {
...
RestrictInfoForEstmate *esclause =
  transformClauseListForEstimation(root, clauses, varRelid);
...

return clause_selectivity(esclause):
  }

  clause_selectivity(RestrictInfoForEstmate *esclause)
  {
if (IsA(clause, RestrictInfo))...
if (IsA(clause, RestrictInfoForEstimate))
{
   RestrictInfoForEstimate *ecl = (RestrictInfoForEstimate*) clause;
   if (ecl-selfunc)
   {
  sx = ecl-selfunc(root, ecl);
   }
}
if (IsA(clause, Var))...
  }

  
  transformClauseListForEstimation(...)
  {
...

relid = collect_mvstats_info(root, clause, attlist);
if (!relid) return;
if (get_mvstats_hook)
 mvstats = (*get_mvstats_hoook) (root, relid, attset);
else
 mvstats = find_mv_stats(root, relid, attset))
  }
  ...

 I've pushed this to github [1] but I need to do some additional
 fixes. I also had to remove some optimizations while fixing this, and
 will have to reimplement those.
 
 That's not to say that the handling of OR-clauses is perfectly
 correct. After looking at clauselist_selectivity_or(), I believe it's
 a bit broken and will need a bunch of fixes, as explained in the
 FIXMEs I pushed to github.
 
 [1] https://github.com/tvondra/postgres/tree/mvstats

I don't see whether it is doable or not, and I suppose you're
unwilling to change the big picture, so I will consider the idea
and will show you the result, if it turns out to be possible and
promising.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] multivariate statistics / patch v7

2015-07-07 Thread Kyotaro HORIGUCHI
Hi, Tomas. I'll kick the gas pedal.

  Thank you, it looks clearer. I have some comment for the brief look
  at this. This patchset is relatively large so I will comment on
  per-notice basis.. which means I'll send comment before examining
  the entire of this patchset. Sorry in advance for the desultory
  comments.
 
 Sure. If you run into something that's not clear enough, I'm happy to
 explain that (I tried to cover all the important details in the
 comments, but it's a large patch, indeed.)


  - Single-variate stats have a mechanism to inject arbitrary
 values as statistics, that is, get_relation_stats_hook and the
 similar stuffs. I want the similar mechanism for multivariate
 statistics, too.
 
 Fair point, although I'm not sure where should we place the hook, how
 exactly should it be defined and how useful that would be in the
 end. Can you give an example of how you'd use such hook?

It's my secret, but is open:p. this is crucial for us to examine
many planner-related problems occurred in our customer in-vitro.

http://pgdbmsstats.osdn.jp/pg_dbms_stats-en.html

# Mmm, this doc is a bit too old..

One tool of ours does like following, 

- Copy pg_statistics and some attributes of pg_class into some
  table. Of course this is exportable.

- For example, in examine_simple_variable, using the hook
  get_relation_stats_hook, inject the saved statistics in place
  of the real statistics.

The hook point is placed where the parameters to specify what
statistics is needed are avaiable in compact shape, and all the
hook function should do is returning corresponding statistics
values.

So the parallel stuff for this mv stats will look like this.

MVStatisticInfo *
get_mv_statistics(PlannerInfo *root, relid);

or 

MVStatisticInfo *
get_mv_statistics(PlannerInfo *root, relid, bitmap or list of attnos);

So by simplly applying this, the current clauselist_selectivity
code will turn into following.

 if (list_length(clauses) == 1)
return clause_selectivity();
 
 Index varrelid = find_singleton_relid(root, clauses, varRelid);
 
 if (varrelid)
 {
 // Bitmapset attnums = collect_attnums(root, clauses, varrelid);
   if (get_mv_statistics_hook)
 stats = get_mv_statistics_hook(root, varrelid /*, attnums */);
   else
 statis = get_mv_statistics(root, varrelid /*, attnums*/);
 
   

In comparison to single statistics, statistics values might be
preferable to separate from definition.

 I've never used get_relation_stats_hook, but if I get it right, the
 plugins can use the hook to create the stats (for each column), either
 from scratch or tweaking the existing stats.

Mostly existing stats without change. I saw few hackers wanted to
provide predefined statistics for typical cases. I haven't see
anyone who tweaks existing stats.

 I'm not sure how this should work with multivariate stats, though,
 because there can be arbitrary number of stats for a column, and it
 really depends on all the clauses (so examine_variable() seems a bit
 inappropriate, as it only sees a single variable at a time).

Restriction clauses are not a problem. What is needed to replace
stats value is defining few APIs to retrieve them, and to
retrieve the stats values only in a way that compatible with the
API. It would be okay to be a substitute views for mv stats as an
extreme case but it is not good.

 Moreover, with multivariate stats
 
(a) there may be arbitrary number of stats for a column
 
(b) only some of the stats end up being used for the estimation
 
 I see two or three possible places for calling such hook:
 
(a) at the very beginning, after fetching the list of stats
 
- sees all the existing stats on a table
- may add entirely new stats or tweak the existing ones

Getting all stats for a table would be okay but attnum list can
restrict the possibilities, as the second form of the example
APIs above. And we may forget the case of forged or tweaked
stats, they are their problem, not ours.


(b) after collecting the list of variables compatible with
multivariate stats
 
- like (a) and additionally knows which columns are interesting
  for the query (but only with respect to the existing stats)

We should carefully design the API to be able to point the
pertinent stats for every situation. Mv stats is based on the
correlation of multiple columns so I think only relid and
attributes list are enough as the parameter.

| if (st.relid == param.relid  bms_equal(st.attnums, param.attnums))
|/* This is the stats to be wanted  */

If we can filter the appropriate stats from all the stats using
clauselist, we definitely can make the appropriate parameter
(column set) prior to retrieving mv statistics. Isn't it correct?

(c) after optimization (selection of the right combination if stats)
 
- like (b), but can't affect the optimization
 
 But I can't really imagine anyone building multivariate stats on the
 fly, in the hook.
 
 It's more 

Re: [HACKERS] multivariate statistics / patch v7

2015-07-07 Thread Tomas Vondra

Hi,

On 07/07/2015 08:05 AM, Kyotaro HORIGUCHI wrote:

Hi, Tomas. I'll kick the gas pedal.


Thank you, it looks clearer. I have some comment for the brief look
at this. This patchset is relatively large so I will comment on
per-notice basis.. which means I'll send comment before examining
the entire of this patchset. Sorry in advance for the desultory
comments.


Sure. If you run into something that's not clear enough, I'm happy to
explain that (I tried to cover all the important details in the
comments, but it's a large patch, indeed.)




- Single-variate stats have a mechanism to inject arbitrary
values as statistics, that is, get_relation_stats_hook and the
similar stuffs. I want the similar mechanism for multivariate
statistics, too.


Fair point, although I'm not sure where should we place the hook,
how exactly should it be defined and how useful that would be in
the end. Can you give an example of how you'd use such hook?


...



We should carefully design the API to be able to point the pertinent
stats for every situation. Mv stats is based on the correlation of
multiple columns so I think only relid and attributes list are
enough as the parameter.

| if (st.relid == param.relid  bms_equal(st.attnums, param.attnums))
|/* This is the stats to be wanted  */

If we can filter the appropriate stats from all the stats using
clauselist, we definitely can make the appropriate parameter (column
set) prior to retrieving mv statistics. Isn't it correct?


Let me briefly explain how the current clauselist_selectivity 
implementation works.


  (1) check if there are multivariate statistics on the table - if not,
  skip the multivariate parts altogether (the point of this is to
  minimize impact on users who don't use the new feature)

  (2) see if the are clauses compatible with multivariate stats - this
  only checks general compatibility without actually checking the
  existing stats (the point is to terminate early, if the clauses
  are not compatible somehow - e.g. if the clauses reference only a
  single attribute, use unsupported operators etc.)

  (3) if there are multivariate stats and compatible clauses, the
  function choose_mv_stats tries to find the best combination of
  multivariate stats with respect to the clauses (details later)

  (4) the clauses are estimated using the stats, the remaining clauses
  are estimated using the current statistics (single attribute)

The only way to reliably inject new stats is by calling a hook before 
(1), allowing it to arbitrarily modify the list of stats. Based on the 
use cases you provided, I don't think it makes much sense to add 
additional hooks in the other phases.


At this place it's however now known what clauses are compatible with 
multivariate stats, or what attributes they are referencing. It might be 
possible to simply call pull_varattnos() and pass it to the hook, except 
that does not work with RestrictInfo :-/


Or maybe we could / should not put the hook into clauselist_selectivity 
but somewhere else? Say, to get_relation_info where we actually read the 
list of stats for the relation?






0001:

- I also don't think it is right thing for expression_tree_walker
to recognize RestrictInfo since it is not a part of expression.


Yes. In my working git repo, I've reworked this to use the second
option, i.e. adding RestrictInfo pull_(varno|varattno)_walker:

https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f

Do you think this is the correct solution? If not, how to fix it?


The reason why I think it is not appropreate is that RestrictInfo
is not a part of expression.

Increasing selectivity of a condition by column correlation is
occurs only for a set of conjunctive clauses. OR operation
devides the sets. Is it agreeable? RestrictInfos can be nested
each other and we should be aware of the AND/OR operators. This
is what expression_tree_walker doesn't.


I still don't understand why you think we need to differentiate between 
AND and OR operators. There's nothing wrong with estimating OR clauses 
using multivariate statistics.




Perhaps we should provide the dedicate function such like
find_conjunctive_attr_set which does this,


Perhaps. The reason why I added support for RestrictInfo into the 
existing walker implementations is that it seemed like the easiest way 
to fix the issue. But if there are reasons why that's incorrect, then 
inventing a new function is probably the right way.




- Check the type top expression of the clause

   - If it is a RestrictInfo, check clause_relids then check
 clause.



   - If it is a bool OR, stop to search and return empty set of
 attributes.



   - If it is a bool AND, make further check of the components. A
 list of RestrictInfo should be treaed as AND connection.



   - If it is operator exression, collect used relids and attrs
 walking the expression tree.



I should missing something 

Re: [HACKERS] multivariate statistics / patch v7

2015-07-07 Thread Tomas Vondra

Hello Horiguchi-san!

On 07/07/2015 09:43 PM, Tomas Vondra wrote:

-- histograms
ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c);
ANALYZE t;

EXPLAIN ANALYZE select * from t where a  0.3 and b  0.3 and c  0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=267033 width=24)
(actual time=0.021..330.487 rows=273897 loops=1)

EXPLAIN ANALYZE select * from t where a  0.3 and b  0.3 or c  0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=14317 width=24)
(actual time=0.027..906.321 rows=966870 loops=1)

EXPLAIN ANALYZE select * from t where a  0.3 and b  0.3 or c  0.9;
Seq Scan on t  (cost=0.00..23870.00 rows=20367 width=24)
(actual time=0.028..452.494 rows=393528 loops=1)

This seems wrong, because the estimate for the OR queries should not be
lower than the estimate for the first query (with just AND), and it
should not increase when increasing the boundary. I'd bet this is a bug
in how the inequalities are handled with histograms, or how the AND/OR
clauses are combined. I'll look into that.


FWIW this was a stupid bug in update_match_bitmap_histogram(), which 
initially handled only AND clauses, and thus assumed the match of a 
bucket can only decrease. But for OR clauses this is exactly the 
opposite (we assume no buckets match and add buckets matching at least 
one of the clauses).


With this fixed, the estimates look like this:

EXPLAIN ANALYZE select * from t where a  0.3 and b  0.3 and c  0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=267033 width=24)
   (actual time=0.102..321.524 rows=273897 loops=1)

EXPLAIN ANALYZE select * from t where a  0.3 and b  0.3 or c  0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=319400 width=24)
   (actual time=0.103..386.089 rows=317533 loops=1)

EXPLAIN ANALYZE select * from t where a  0.3 and b  0.3 or c  0.3;
Seq Scan on t  (cost=0.00..23870.00 rows=956833 width=24)
   (actual time=0.133..908.455 rows=966870 loops=1)

EXPLAIN ANALYZE select * from t where a  0.3 and b  0.3 or c  0.9;
Seq Scan on t  (cost=0.00..23870.00 rows=393633 width=24)
   (actual time=0.105..440.607 rows=393528 loops=1)

IMHO pretty accurate estimates - no issue with OR clauses.

I've pushed this to github [1] but I need to do some additional fixes. I 
also had to remove some optimizations while fixing this, and will have 
to reimplement those.


That's not to say that the handling of OR-clauses is perfectly correct. 
After looking at clauselist_selectivity_or(), I believe it's a bit 
broken and will need a bunch of fixes, as explained in the FIXMEs I 
pushed to github.


[1] https://github.com/tvondra/postgres/tree/mvstats

kind regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multivariate statistics / patch v7

2015-07-04 Thread Tomas Vondra

Hello Horiguchi-san!

On 07/03/2015 07:30 AM, Kyotaro HORIGUCHI wrote:

Hello, I started to work on this patch.


attached is v7 of the multivariate stats patch. The main improvement
is major refactoring of the clausesel.c portion - splitting the
awfully long spaghetti-style functions into smaller pieces, making it
much more understandable etc.


Thank you, it looks clearer. I have some comment for the brief look
at this. This patchset is relatively large so I will comment on
per-notice basis.. which means I'll send comment before examining
the entire of this patchset. Sorry in advance for the desultory
comments.


Sure. If you run into something that's not clear enough, I'm happy to 
explain that (I tried to cover all the important details in the 
comments, but it's a large patch, indeed.)



===
General comments:

- You included unnecessary stuffs such like regression.diffs in
   these patches.


A :-/ Will fix.



- Now OID 3307 is used by pg_stat_file. I moved
   pg_mv_stats_dependencies_info/show to 3311/3312.


Will fix while rebasing to current master.



- Single-variate stats have a mechanism to inject arbitrary
   values as statistics, that is, get_relation_stats_hook and the
   similar stuffs. I want the similar mechanism for multivariate
   statistics, too.


Fair point, although I'm not sure where should we place the hook, how 
exactly should it be defined and how useful that would be in the end. 
Can you give an example of how you'd use such hook?


I've never used get_relation_stats_hook, but if I get it right, the 
plugins can use the hook to create the stats (for each column), either 
from scratch or tweaking the existing stats.


I'm not sure how this should work with multivariate stats, though, 
because there can be arbitrary number of stats for a column, and it 
really depends on all the clauses (so examine_variable() seems a bit 
inappropriate, as it only sees a single variable at a time).


Moreover, with multivariate stats

   (a) there may be arbitrary number of stats for a column

   (b) only some of the stats end up being used for the estimation

I see two or three possible places for calling such hook:

   (a) at the very beginning, after fetching the list of stats

   - sees all the existing stats on a table
   - may add entirely new stats or tweak the existing ones

   (b) after collecting the list of variables compatible with
   multivariate stats

   - like (a) and additionally knows which columns are interesting
 for the query (but only with respect to the existing stats)

   (c) after optimization (selection of the right combination if stats)

   - like (b), but can't affect the optimization

But I can't really imagine anyone building multivariate stats on the 
fly, in the hook.


It's more complicated, though, because the query may call 
clauselist_selectivity multiple times, depending on how complex the 
WHERE clauses are.




0001:

- I also don't think it is right thing for expression_tree_walker
   to recognize RestrictInfo since it is not a part of expression.


Yes. In my working git repo, I've reworked this to use the second 
option, i.e. adding RestrictInfo pull_(varno|varattno)_walker:


https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f

Do you think this is the correct solution? If not, how to fix it?



0003:

- In clauselist_selectivity, find_stats is uselessly called for
   single clause. This should be called after the clauselist found
   to consist more than one clause.


Ok, will fix.



- Searching vars to be compared with mv-stat columns which
   find_stats does should stop at disjunctions. But this patch
   doesn't behave so and it should be an unwanted behavior. The
   following steps shows that.


Why should it stop at disjunctions? There's nothing wrong with using 
multivariate stats to estimate OR-clauses, IMHO.





  =# CREATE TABLE t1 (a int, b int, c int);
  =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0, ) a);
  =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
   Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
  =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
  =# ANALZYE t1;
  =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
   Seq Scan on t1  (cost=0.00..230.00 rows=268 width=12)

  Rows changed unwantedly.


That has nothing to do with OR clauses, but rather with using a type of 
statistics that does not fit the data and queries. Histograms are quite 
inaccurate for discrete data and equality conditions - in this case the 
clauses probably match one bucket, and so we use 1/2 the bucket as an 
estimate. There's nothing wrong with that.


So let's use MCV instead:

ALTER TABLE t1 ADD STATISTICS (MCV) ON (a, b, c);
ANALYZE t1;
EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
 QUERY PLAN
-
 Seq Scan on t1  (cost=0.00..230.00 rows=1 

Re: [HACKERS] multivariate statistics / patch v7

2015-07-02 Thread Kyotaro HORIGUCHI
Hello, I started to work on this patch.

 attached is v7 of the multivariate stats patch. The main improvement
 is major refactoring of the clausesel.c portion - splitting the
 awfully long spaghetti-style functions into smaller pieces, making it
 much more understandable etc.

Thank you, it looks clearer. I have some comment for the brief
look at this. This patchset is relatively large so I will comment
on per-notice basis.. which means I'll send comment before
examining the entire of this patchset. Sorry in advance for the
desultory comments.

===
General comments:

- You included unnecessary stuffs such like regression.diffs in
  these patches.

- Now OID 3307 is used by pg_stat_file. I moved
  pg_mv_stats_dependencies_info/show to 3311/3312.

- Single-variate stats have a mechanism to inject arbitrary
  values as statistics, that is, get_relation_stats_hook and the
  similar stuffs. I want the similar mechanism for multivariate
  statistics, too.

0001:

- I also don't think it is right thing for expression_tree_walker
  to recognize RestrictInfo since it is not a part of expression.

0003:

- In clauselist_selectivity, find_stats is uselessly called for
  single clause. This should be called after the clauselist found
  to consist more than one clause.

- Searching vars to be compared with mv-stat columns which
  find_stats does should stop at disjunctions. But this patch
  doesn't behave so and it should be an unwanted behavior. The
  following steps shows that.


 =# CREATE TABLE t1 (a int, b int, c int);
 =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0, ) a);
 =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
  Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
 =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
 =# ANALZYE t1;
 =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
  Seq Scan on t1  (cost=0.00..230.00 rows=268 width=12)

 Rows changed unwantedly.

 It seems not so simple thing as your code assumes.

 I do assume some of those pieces are unnecessary because there already
 is a helper function with the same purpose (but I'm not aware of
 that). But IMHO this piece of code begins to look reasonable
 (especially when compared to the previous state).

Year, such kind of work should be done later:p This patch is
not-so-invasive so as to make it undoable.

 The other major improvement it review of the comments (including
 FIXMEs and TODOs), and removal of the obsolete / misplaced ones. And
 there was plenty of those ...
 
 These changes made this version ~20k smaller than v6.
 
 The patch also rebases to current master, which I assume shall be
 quite stable - so hopefully no more duplicate OIDs for a while.
 
 There are 6 files attached, but only 0002-0006 are actually part of
 the multivariate statistics patch itself. The first part makes it
 possible to use pull_varnos() with expression trees containing
 RestrictInfo nodes, but maybe this is not the right way to fix this
 (there's another thread where this was discussed).

As mentioned above, checking if mv stats can be applied would be
more complex matter than now you are assuming. I also will
consider that.

 Also, the regression tests testing plan choice with multivariate stats
 (e.g. that a bitmap index scan is chosen instead of index scan) fail
 from time to time. I suppose this happens because the invalidation
 after ANALYZE is not processed before executing the query, so the
 optimizer does not see the stats, or something like that.

I saw that occurs, but have no idea how it occurs so far..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers