Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Nathan Boley
Right now our histogram values are really quantiles; the statistics_target T for a column determines a number of quantiles we'll keep track of, and we grab values from into an ordered list L so that approximately 1/T of the entries in that column fall between values L[n] and L[n+1]. I'm

Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Nathan Boley
I'm still working my way around the math, but copulas sound better than anything else I've been playing with. I think the easiest way to think of them is, in 2-D finite spaces, they are just a plot of the order statistics against one another. Feel free to mail me off list if you have any math

Re: [HACKERS] Cross-column statistics revisited

2008-10-19 Thread Nathan Boley
I still need to go through backend/utils/adt/selfuncs.c to figure out exactly how we use the one-dimensional values. Here's a page that helped me figure all this out. http://www.postgresql.org/docs/8.1/static/planner-stats-details.html 2) Do we want to fold the MCV's into the dependence

Re: [HACKERS] Simple postgresql.conf wizard

2008-12-05 Thread Nathan Boley
Thanks for putting out pgtune - it's a sorely needed tool. I had a chance to look over the source code and have a few comments, mostly about python specific coding conventions. - The windows specific try block ( line 16 ) raises a ValueError vs ImportError on my debian machine. Maybe it would be

Re: [HACKERS] Python 3.1 support

2009-11-18 Thread Nathan Boley
Again, I'm only one user.  But so far I haven't seen anyone else speak up here, and clearly accepting this for inclusion will need nontrivial convincing. Well, FWIW, I am excited about better type integration. Also, I am a little skeptical about this patch. I am sorry if this has already been

Re: [HACKERS] Python 3.1 support

2009-11-18 Thread Nathan Boley
Here's the patch to support Python =3.1 with PL/Python.  The compatibility code is mostly in line with the usual 2-3 C porting practice and is documented inline. I took a cursory look at this patch and, while the logic seems sound and roughly in line with the suggested python porting

Re: [HACKERS] Range types

2009-12-14 Thread Nathan Boley
Because intervals (mathematical not SQL) can be open or closed at each end point we need to know what the next an previous value would be at the specified granularity. And while you can do some operations without knowing this, there are many you can't. For instance you could not tell whether

Re: [HACKERS] Range types

2009-12-14 Thread Nathan Boley
Right, I don't think strings are any more or less countable than integers. (and yes, it's a bit OT). Well, if I remember my algebra, I think the issue is that integers are locally finite whereas strings are not ( assuming standard orderings, of course ). -Nathan -- Sent via pgsql-hackers

Re: [HACKERS] plpython3

2010-02-01 Thread Nathan Boley
On the basis of all of the foregoing, I don't think we can consider this patch further for this CommitFest and will update commitfest.postgresql.org accordingly. FWIW, I am very excited about this patch and would be happy to review it but have been very busy over the past month. If I can

Re: [HACKERS] plpython3

2010-02-01 Thread Nathan Boley
I think it would be great for you to review it... I doubt that will cause it to get committed for 9.0, but my doubt is no reason for you to hold off reviewing it. I assumed so, but the pretense of a chance will probably help to motivate me :-) I'll have something by Thursday, and then

Re: [HACKERS] extended operator classes vs. type interfaces

2010-04-09 Thread Nathan Boley
The advantage of specifying a + and a - in the type interface is that the unit definition can then be specified as part of the type declaration itself.  So you can do: CREATE TYPE ts_sec AS RANGE OVER timestamp (UNIT = '1s'); CREATE TYPE ts_min AS RANGE OVER timestamp (UNIT = '1m'); All of

Re: [HACKERS] operator exclusion constraints

2009-11-01 Thread Nathan Boley
After reading the docs in the patch I don't believe you're going to all this trouble to ensure two circles don't overlap. Can you give some better examples of what you're trying to achieve and why anyone else would care? (I'm busy, so are others). Non overlapping time intervals is one use

[HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-08 Thread Nathan Boley
Currently eqsel assumes that, except for the values stored as mcv's, the number of times that a value appears in a table column is independent of it's value. Unfortunately, this often yields inappropriate selectivity estimates and frequently leads to inappropriate plans. As an example, consider

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-09 Thread Nathan Boley
Your argument seems to consider only columns having a normal distribution. My example was based upon normally distributed data because people usually know what they are and they are reasonably common. How badly does it fall apart for non-normal distributions? This should work to the extent

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
One more problem with low ndistinct values is that the condition might very well hit no rows at all. But Idea 1 will largely overestimate the number of hits. Thats a good point, but I don't see a clear solution. Maybe we could look at past queries and keep track of how often they return

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
One more problem with low ndistinct values is that the condition might very well hit no rows at all. But Idea 1 will largely overestimate the number of hits. Thats a good point, but I don't see a clear solution. Maybe we could I think that MCVs are the solution, right? Only if

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
If we query on values that aren't in the table, the planner will always overestimate the expected number of returned rows because it ( implicitly ) assumes that every query will return at least 1 record. That's intentional and should not be changed. Why? What if ( somehow ) we knew that

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-10 Thread Nathan Boley
Why? What if ( somehow ) we knew that there was a 90% chance that query would return an empty result set on a big table with 20 non-mcv distinct values. Currently the planner would always choose a seq scan, where an index scan might be better. (1) On what grounds do you assert the above?

Re: [HACKERS] Proposal - improve eqsel estimates by including histogram bucket numdistinct statistics

2008-06-12 Thread Nathan Boley
Assuming that the threshold for switching to an indexscan is somewhere around selectivity 0.005 (I am not certain offhand, but it's in that general area), this cannot possibly require more than 200 MCV slots, and for most data distributions it'd be a whole lot less. Thats a really good

Re: [HACKERS] IN, BETWEEN, spec compliance, and odd operator names

2008-08-25 Thread Nathan Boley
Yes, but always in relation to operator classes, so from BTrees opclass or such, which I refered to as the context of index searches, as I don't really see any theorical need for opclass if it's not for indexing. IIRC analyze uses the btree op class to build the selectivity histogram. -Nathan

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley
One more direction could be implementing MCV for range of values (group values and interpolate in between). Consider statistics on timestamp column that says that for 2008-December there are as many X rows, for 2008-November as many as Y, etc. That could be used for rather accurate

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley
What is the specific difference between what you are talking about and what scalarineqsel already implements? Hmm... Northing new. Feel sorry for bothering you. I did not realize histograms are implemented. Well, ISTM there is a profound difference. For scalarineqsel we care about the total

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley
Thanks for the response. Well, ISTM there is a profound difference. For scalarineqsel we care about the total number of values in a bucket. For eqsel we care about the total number of *distinct* values in each bucket Really? Well, we would obviously also care about the total number of

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley
Isn't a selectivity estimate of x = v as ( the number of values in v's histogram bucket ) / ( number of distinct values in v's histogram bucket ) pretty rational? Thats currently what we do for non-mcv values, except that we look at ndistinct over the whole table instead of individual

Re: [HACKERS] hist boundary duplicates bug in head and 8.3

2009-01-06 Thread Nathan Boley
For heavy tailed distributions, it is possible for analyze to duplicate histogram boundaries. I don't think this is a bug. hmmm... Well, I assumed it was a bug from a comment in analyze. From ( near ) line 2130 in analyze.c * least 2 instances in the sample. Also, we won't suppress values

Re: [HACKERS] hist boundary duplicates bug in head and 8.3

2009-01-07 Thread Nathan Boley
Surely the most important point in the OP was that ineqsel does not correctly binary search in the presence of duplicates. It would have been if I were correct :-( . Looking at it again, that was from a bug in my code. Thanks for your time, and sorry about the noise. -Nathan -- Sent via

Re: [HACKERS] Multi-Dimensional Histograms

2009-06-29 Thread Nathan Boley
For things like PostGIS, which will want to index in 4 dimensions (x, y, z, t), we might want to have multi-dimensional selectivity histograms and some way to use same. Another use case is cross column statistics. Anybody here qualified to check out this paper on the subject, please speak

Re: [HACKERS] Multi-Dimensional Histograms

2009-06-29 Thread Nathan Boley
Finally, this creates the partition but ( AFAICT ) it doesn't describe a method for locating the histogram estimate given a point ( although that doesn't seem too difficult ). Is that not difficult, in terms of the math that needs doing, or not difficult, in terms of how well PostgreSQL is

Re: [HACKERS] Multi-Dimensional Histograms

2009-06-29 Thread Nathan Boley
On Mon, Jun 29, 2009 at 3:43 PM, Tom Lanet...@sss.pgh.pa.us wrote: David Fetter da...@fetter.org writes: On Mon, Jun 29, 2009 at 01:28:01PM -0700, Nathan Boley wrote: ... They dismiss singular value decomposition and the discrete wavelet transform as being too parametric ( which is silly

Re: [HACKERS] Multi-Dimensional Histograms

2009-06-30 Thread Nathan Boley
I'm finding myself unable to follow all the terminology on this thead.  What's dimension reduction?  What's PCA? ( Sorry for the jargon - thanks Josh ) It feels like what you might need is statistics for colB (MCVs and/or a histogram) for certain particular values of colA. Certainly - this

Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-25 Thread Nathan Boley
I actually don't know much about the I/O subsystem, but, no, WAL is not separated from data.  I believe $PGDATA is on a SAN, but I don't know anything about its characteristics. The computer is here: http://www.supermicro.nl/Aplus/system/2U/2042/AS-2042G-6RF.cfm $PGDATA is on a 5 disk SATA

Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Nathan Boley
On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote: Alexander Korotkov aekorot...@gmail.com writes: On Mon, Jan 23, 2012 at 7:58 PM, Noah Misch n...@leadboat.com wrote: I've attached a new version that includes the UINT64_FMT fix, some edits of your newest comments, and a

Re: [HACKERS] Collect frequency statistics for arrays

2012-02-29 Thread Nathan Boley
On Wed, Feb 29, 2012 at 2:43 PM, Tom Lane t...@sss.pgh.pa.us wrote: Nathan Boley npbo...@gmail.com writes: On Wed, Feb 29, 2012 at 12:39 PM, Tom Lane t...@sss.pgh.pa.us wrote: I am starting to look at this patch now.  I'm wondering exactly why the decision was made to continue storing btree

Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Nathan Boley
What about MCV's? Will those be removed as well? Sure.  Those seem even less useful. Ya, this will destroy the performance of several queries without some heavy tweaking. Maybe this is bad design, but I've gotten in the habit of storing sequences as arrays and I commonly join on them. I

Re: [HACKERS] Collect frequency statistics for arrays

2012-03-01 Thread Nathan Boley
[ sorry Tom, reply all this time... ] What do you mean by storing sequences as arrays? So, a simple example is, for transcripts ( sequences of DNA that are turned into proteins ), we store each of the connected components as an array of the form: exon_type in [1,6] splice_type = [1,3] and

Re: [HACKERS] proposal : cross-column stats

2010-12-12 Thread Nathan Boley
The proposed solution is based on contingency tables, built for selected groups of columns (not for each possible group). And the contingency table gives you the ability to estimate the probabilities needed to compute the selectivity. Or am I missing something? Well, I'm not real familiar

Re: [HACKERS] Why don't we accept exponential format for integers?

2010-12-17 Thread Nathan Boley
print int(1e+01) 10 That isn't building an integer: it is creating a float and casting to an int. try: int( 1e100 ) Best, Nathan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra t...@fuzzy.cz wrote: Dne 19.1.2011 20:25, Robert Haas napsal(a): On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote: Yes, I was aware of this problem (amount of memory consumed with lots of updates), and I kind of hoped someone will

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
If you think about it, it's a bit ridiculous to look at the whole table *just* to estimate ndistinct - if we go that far why dont we just store the full distribution and be done with it? - the best you could do is to average the individual probabilities which gives ... well, 1/ndistinct.

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
to get it committed ( check the archives for cross column stat threads - there are a lot ). Best, Nathan Boley -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug f...@phlo.org wrote: On Jan20, 2011, at 02:41 , Nathan Boley wrote: If you think about it, it's a bit ridiculous to look at the whole table *just* to estimate ndistinct - if we go that far why dont we just store the full distribution and be done

Re: [HACKERS] Measure Theoretic Data Types in Postgresql

2012-10-12 Thread Nathan Boley
However, by realizing that the bounds on the ranges have a linear ordering one can speed this up to 0(m) using windowing functions on common table expressions. So what I am proposing is formalizing this optimization into a class of data types, that will hide the implementation details.

[HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays

2011-07-14 Thread Nathan Boley
Review of patch: https://commitfest.postgresql.org/action/patch_view?id=539 I glanced through the code and played around with the feature and, in general, it looks pretty good. But I have a some comments/questions. Design: First, it makes me uncomfortable that you are using the MCV and

Re: [HACKERS] Patch Review: Collect frequency statistics and selectivity estimation for arrays

2011-07-15 Thread Nathan Boley
Actually, not MCV slot is used but MCELEM. It is a different slot. ps_stats view map both into same fileds. Yes, you're right. I am sorry about the error. Surely, non-standard use of histogram slot should be avoided. Agreed. Best, Nathan -- Sent via pgsql-hackers mailing list

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Nathan Boley
Robert explained why having more MCVs might be useful because we use the frequency of the least common MCV as an upper bound on the frequency of any value in the MCV. Where is that being used? The only non-MCV frequency estimate that I recall seeing is ( nrows - n_ndistinct_rows )/ndistinct.

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Nathan Boley
That one's used, too, but the other is used as an upper bound. n_distinct tends to come out too small on large tables, so that formula is prone to overestimation.  Actually, both formulas are prone to overestimation. Right - thanks. When this happens depends on the values of a whole

Re: [HACKERS] WIP: collect frequency statistics for arrays

2011-10-15 Thread Nathan Boley
Looking now, I see that Alexander wasn't Cc'd on the review, so it's possible he missed the message? We've corresponded off list and have discussed my review at some length. Alex submitted an updated patch on Sep 22 to me personally ( although not to the list? Alex? ), with the promise of a

Re: [HACKERS] Collect frequency statistics for arrays

2011-11-15 Thread Nathan Boley
Rebased with head. FYI, I've added myself as the reviewer for the current commitfest. Best, Nathan Boley -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: collect frequency statistics for arrays

2011-11-28 Thread Nathan Boley
Version of patch with few more comments and some fixes. Where are we on the idea of better statistics for arrays? I need to finish reviewing the patch: I'll try to send in something tomorrow night. So far it looks good. Best, Nathan -- Sent via pgsql-hackers mailing list

Re: [HACKERS] Range Types: empty ranges

2011-02-11 Thread Nathan Boley
FWIW, a very informal survey of probabilists didn't yield any reason for trying to put an order on the empty set ( unless the metric was cardinality or other equivalence relation ). I think the problem here is that the idea of union and intersection forming a ring over sets is being conflated

Re: [HACKERS] WIP: cross column correlation ...

2011-02-22 Thread Nathan Boley
Personally, I think the first thing we ought to do is add a real, bona fide planner hint to override the selectivity calculation manually, maybe something like this: WHERE (x 5 AND y = 1) SELECTIVITY (0.1); If you're going to go that far, why not just collect statistics on that specific