Re: [HACKERS] WIP: multivariate statistics / proof of concept

2014-11-13 Thread Katharina Büchse

On 13.11.2014 14:11, Tomas Vondra wrote:

Dne 13 Listopad 2014, 12:31, Simon Riggs napsal(a):

On 12 October 2014 23:00, Tomas Vondra t...@fuzzy.cz wrote:


It however seems to be working sufficiently well at this point, enough
to get some useful feedback. So here we go.

This looks interesting and useful.

What I'd like to check before a detailed review is that this has
sufficient applicability to be useful.

My understanding is that Q9 and Q18 of TPC-H have poor plans as a
result of multi-column stats errors.

Could you look at those queries and confirm that this patch can
produce better plans for them?

Sure. I planned to do such verification/demonstration anyway, after
discussing the overall approach.

I planned to give it a try on TPC-DS, but I can start with the TPC-H
queries you propose. I'm not sure whether the poor estimates in Q9  Q18
come from column correlation though - if it's due to some other issues
(e.g. conditions that are difficult to estimate), this patch can't do
anything with them. But it's a good start.


If so, I will work with you to review this patch.

Thanks!


One aspect of the patch that seems to be missing is a user declaration
of correlation, just as we have for setting n_distinct. It seems like
an even easier place to start to just let the user specify the stats
declaratively. That way we can split the patch into two parts. First,
allow multi column stats that are user declared. Then add user stats
collected by ANALYZE. The first part is possibly contentious and thus
a good initial focus. The second part will have lots of discussion, so
good to skip for a first version.

I'm not a big fan of this approach, for a number of reasons.

Firstly, it only works for simple parameters that are trivial to specify
(say, Pearson's correlation coefficient), and the patch does not work with
those at all - it only works with histograms, MCV lists (and might work
with associative rules in the future). And we certainly can't ask users to
specify multivariate histograms - because it's very difficult to do, and
also because complex stats are more susceptible to get stale after adding
new data to the table.

Secondly, even if we add such simple parameters to the patch, we have to
come up with a  way to apply those parameters to the estimates. The
problem is that as the parameters get simpler, it's less and less useful
to compute the stats.

Another question is whether it should support more than 2 columns ...

The only place where I think this might work are the associative rules.
It's simple to specify rules like (ZIP code implies city) and we could
even do some simple check against the data to see if it actually makes
sense (and 'disable' the rule if not).
and even this simple example has its limits, at least in Germany ZIP 
codes are not unique for rural areas, where several villages have the 
same ZIP code.


I guess there are just a few examples where columns are completely 
functional dependent without any exceptions.
But of course, if the user gives this information just for optimization 
the statistics, some exceptions don't matter.
If this information should be used for creating different execution 
plans (e.g. on column A is an index and column B is functional 
dependent, one could think about using this index on A and the 
dependency instead of running through the whole table to find all tuples 
that fit the query on column B), exceptions are a very important issue.


But maybe I got it wrong and you have something particular in mind? Can
you give an example of how it would work?

regards
Tomas






--
Dipl.-Math. Katharina Büchse
Friedrich-Schiller-Universität Jena
Institut für Informatik
Lehrstuhl für Datenbanken und Informationssysteme
Ernst-Abbe-Platz 2
07743 Jena
Telefon 03641/946367
Webseite http://users.minet.uni-jena.de/~re89qen/



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] two dimensional statistics in Postgres

2014-11-07 Thread Katharina Büchse

Ahoj ;-)

On 06.11.2014 11:56, Tomas Vondra wrote:

Hi,

Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a):

Hi,

I'm a phd-student at the university of Jena, Thüringen, Germany, in the
field of data bases, more accurate query optimization.
I want to implement a system in PostgreSQL that detects column
correlations and creates statistical data about correlated columns for
the optimizer. Therefore I need to store two dimensional statistics
(especially two dimensional histograms) in PostgreSQL.

Cool!


I had a look at the description of WIP: multivariate statistics / proof
of concept, which looks really promising, I guess these statistics are
based on scans of the data? For my system I need both -- statistical

Yes, it's based on a sample of the data.


data based on table scans (actually, samples are enough) and those based
on query feedback. Query feedback (tuple counts and, speaking a little
inaccurately, the where-part of the query itself) needs to be extracted
and there needs to be a decision for the optimizer, when to take
multivariate statistics and when to use the one dimensional ones. Oracle
in this case just disables one dimensional histograms if there is
already a multidimensional histogram, but this is not always useful,
especially in the case of a feedback based histogram (which might not
cover the whole data space). I want to use both kinds of histograms

What do you mean by not covering the whole data space? I assume that when
building feedback-based histogram, parts of the data will be filtered out
because of WHERE clauses etc. Is that what you mean? I don't see how this
could happen for regular histograms, though.
Yes, you're right. Because of the where clauses, some parts of the data 
might be filtered out in feedback based histograms. This cannot happen 
in regular histograms, but as I mentioned -- I would like to use both 
kinds of histograms.



because correlations might occur only in parts of the data. In this case
a histogram based on a sample of the whole table might not get the point
and wouldn't help for the part of the data the user seems to be
interested in.

Yeah, there may be dependencies that are difficult to spot in the whole
dataset, but emerge once you filter to a specific subset.

Now, how would that work in practice? Initially the query needs to be
planned using regular stats (because there's no feedback yet), and then -
when we decide the estimates are way off - may be re-planned using the
feedback. The feedback is inherently query-specific, so I'm not sure if
it's possible to reuse it for multiple queries (might be possible for
sufficiently similar ones).

Would this be done automatically for all queries / all conditions, or only
when specifically enabled (on a table, columns, ...)?
The idea is the following: I want to find out correlations with two 
different algorithms, one scanning some samples of the data, the other 
analyzing query feedback. If both decide for a column combination that 
it's correlated, then there should be made a regular histogram for 
this combination. If only the scanning-algorithm says correlated, 
then it means that there is some correlation, but this is not 
interesting for the user right now. So I would only leave some note 
for the optimizer that there is correlation and if the user interest 
changes and query results differ a lot from the estimates in the plan, 
then again -- regular histogram. If only the feedback-algorithm 
decides that the columns are correlated, a histogram based on query 
feedback is the most useful choice to support the work of the optimizer.



There are special data structures for storing multidimensional
histograms based on feedback and I already tried to implement one of
these in C. In the case of two dimensions they are of course not for
free (one dimensional would be much cheaper), but based on the
principle of maximum entropy they deliver really good results. I decided
for only two dimensions because in this case we have the best proportion
of cost and benefit when searching for correlation (here I'm relying on

I think hardcoding the two-dimensions limit is wrong. I understand higher
number of dimensions means more expensive operation, but if the user can
influence it, I believe it's OK.
I don't know whether the user has to decide whether the statistical data 
is based on feedback or on data scans. I guess it's enough if he gets 
his histograms in higher dimensions based on data scans.


Also, is there any particular reason why not to support other kinds of
stats (say, MCV lists)? In the end it's just a different way to
approximate the distribution, and it may be way cheaper than histograms.
The reason actually is just that 1) I have only limited time and cannot 
cover every possibility to support the optimizer when there is 
correlation and 2) there are good papers about feedback based histograms :-D



tests that were made in DB2 within a project called CORDS which detects
correlations even between