Re: [HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays

2011-07-15 Thread Alexander Korotkov
Hi!

Thank you for review. I've few questions about it.

On Fri, Jul 15, 2011 at 2:13 AM, Nathan Boley npbo...@gmail.com wrote:

 First, it makes me uncomfortable that you are using the MCV and histogram
 slot
 kinds in a way that is very different from other data types.

 I realize that tsvector uses MCV in the same way that you do but:

 1) I don't like that very much either.
 2) TS vector is different in that equality ( in the btree sense )
doesn't make sense, whereas it does for arrays.

 Using the histogram slot for the array lengths is also very surprising to
 me.

 Why not just use a new STA_KIND? It's not like we are running out of
 room, and this will be the second 'container' type that splits the
 container
 and stores stats about the elements.

Thus, do you think we should collect both btree and frequency/length
statistics for arrays?


 1) In calc_distr you go to some lengths to avoid round off errors. Since it
 is
   certainly just the order of the estimate that matters, why not just
   perform the calculation in log space?

It seems to me that I didn't anything to avoid round off errors there...

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays

2011-07-15 Thread Alexander Korotkov
On Fri, Jul 15, 2011 at 2:13 AM, Nathan Boley npbo...@gmail.com wrote:

 First, it makes me uncomfortable that you are using the MCV and histogram
 slot
 kinds in a way that is very different from other data types.

 I realize that tsvector uses MCV in the same way that you do but:

 1) I don't like that very much either.
 2) TS vector is different in that equality ( in the btree sense )
doesn't make sense, whereas it does for arrays.

Actually, not MCV slot is used but MCELEM. It is a different slot. ps_stats
view map both into same fileds. Surely, non-standard use of histogram slot
should be avoided.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays

2011-07-15 Thread Nathan Boley
 Actually, not MCV slot is used but MCELEM. It is a different slot. ps_stats
 view map both into same fileds.

Yes, you're right. I am sorry about the error.

 Surely, non-standard use of histogram slot
 should be avoided.

Agreed.

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays

2011-07-14 Thread Nathan Boley
Review of patch: https://commitfest.postgresql.org/action/patch_view?id=539

I glanced through the code and played around with the feature and,
in general, it looks pretty good. But I have a some comments/questions.

Design:

First, it makes me uncomfortable that you are using the MCV and histogram slot
kinds in a way that is very different from other data types.

I realize that tsvector uses MCV in the same way that you do but:

1) I don't like that very much either.
2) TS vector is different in that equality ( in the btree sense )
doesn't make sense, whereas it does for arrays.

Using the histogram slot for the array lengths is also very surprising to me.

Why not just use a new STA_KIND? It's not like we are running out of
room, and this will be the second 'container' type that splits the container
and stores stats about the elements.


Second, I'm fairly unclear about what the actual underlying statistical
method is and what assumptions it makes. Before I can really review
the method, I will probably need to see some documentation, as I say below.
Do you have any tests/simulations that explore the estimation procedure
that I can look at? When will it fail? In what cases does it improve
estimation?

Documentation:

Seems a bit lacking - especially if you keep the non standard usage of
mcv/histograms. Also, it would be really nice to see some update to
the row-estimation-examples page ( in chapter 55 ).

The code comments are good in general, but there are not enough high level
comments . It would be really nice to have a comment that gives an overview
of what each of the following functions are supposed to do:

mcelem_array_selec
mcelem_array_contain_overlap_selec

and especially

mcelem_array_contained_selec

Also, in the compute_array_stats you reference compute_tsvector_stats for
the algorithm, but I think that it would be smarter to either copy the
relevant portions of the comment over or to reference a published document.
If compute_tsvector_stats changes, I doubt that anyone will remember to
update the comment.

Finally, it would be really nice to see, explicitly, and in
mathematical notation

1) The data that is being collected and summarized. ( the statistic )
2) The estimate being derived from the statistic ( ie the selectivity )
i) Any assumptions being made ( ie independence of elements within
an array )

For instance, the only note I could find that actually addressed the
estimator and
assumptions underneath it was

+* mult * dist[i] / mcelem_dist[i] gives us probability 
probability
+* of qual matching from assumption of independent 
element occurence
+* with condition that length = i.

and that's pretty cryptic. That should be expanded upon and placed
more prominently.

A couple nit picks:

1) In calc_distr you go to some lengths to avoid round off errors. Since it is
   certainly just the order of the estimate that matters, why not just
   perform the calculation in log space?

2) compute_array_stats is really long. Could you break it up? ( I know that
   the other analyze functions are the same way, but why continue in
that vein? )

Best,
Nathan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers