Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-31 Thread Jesper Krogh
On 2010-05-30 20:02, Jan Urbański wrote: Here's a patch against recent git, but should apply to 8.4 sources as well. It would be interesting to measure the memory and time needed to analyse the table after applying it, because we will be now using a lot bigger bucket size and I haven't done any

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-31 Thread Tom Lane
Jesper Krogh jes...@krogh.cc writes: Just a small follow up. I tried out the patch (or actually a fresh git checkout) and it now gives very accurate results for both upper and lower end of the MCE-histogram with a lower cutoff that doesn't approach 2. Good. How much did the ANALYZE time

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-31 Thread Jesper Krogh
On 2010-05-31 20:38, Tom Lane wrote: Jesper Kroghjes...@krogh.cc writes: Just a small follow up. I tried out the patch (or actually a fresh git checkout) and it now gives very accurate results for both upper and lower end of the MCE-histogram with a lower cutoff that doesn't approach 2.

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jesper Krogh
On 2010-05-29 15:56, Jan Urbański wrote: On 29/05/10 12:34, Jesper Krogh wrote: On 2010-05-28 23:47, Jan Urbański wrote: On 28/05/10 22:22, Tom Lane wrote: Now I tried to substitute some numbers there, and so assuming the English language has ~1e6 words H(W) is around 6.5. Let's

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Tom Lane
Jesper Krogh jes...@krogh.cc writes: On 2010-05-29 15:56, Jan Urbański wrote: AFAIK statistics for everything other than tsvectors are built based on the values of whole rows. Wouldn't it make sense to treat array types like the tsvectors? Yeah, I have a personal TODO item to look into that

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jan Urbański
Jesper Krogh jes...@krogh.cc writes: On 2010-05-29 15:56, Jan Urbański wrote: AFAIK statistics for everything other than tsvectors are built based on the values of whole rows. Wouldn't it make sense to treat array types like the tsvectors? Yeah, I have a personal TODO item to look

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Tom Lane
Jan =?UTF-8?Q?Urba=C5=84ski?= wulc...@wulczer.org writes: I think the only relevance of stopwords to the current problem is that *if* stopwords have been removed, we would see a Zipfian distribution with the first few entries removed, and I'm not sure if it's still really Zipfian afterwards.

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jan Urbański
On 30/05/10 09:08, Jesper Krogh wrote: On 2010-05-29 15:56, Jan Urbański wrote: On 29/05/10 12:34, Jesper Krogh wrote: I can fairly easy try out patches or do other kind of testing. I'll try to come up with a patch for you to try and fiddle with these values before Monday. Here's a

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: Here's a patch against recent git, but should apply to 8.4 sources as well. It would be interesting to measure the memory and time needed to analyse the table after applying it, because we will be now using a lot bigger bucket size

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jan Urbański
On 31/05/10 00:07, Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: I committed the attached revised version of the patch. Revisions are mostly minor but I did make two substantive changes: * The patch changed the target number of mcelems from 10 *

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jesper Krogh
On 2010-05-28 23:47, Jan Urbański wrote: On 28/05/10 22:22, Tom Lane wrote: The idea that I was toying with is to assume a Zipfian distribution of the input (with some reasonable parameter), and use that to estimate what the frequency of the K'th element will be, where K is the target

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jesper Krogh
On 2010-05-28 04:47, Tom Lane wrote: Cranking up the stats target actually makes it worse not better, since low-frequency items are then more likely to get into the MCV list I should have been more precise in the wording. Cranking up the stats target gave me overall a better plan, but that

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański
On 29/05/10 12:34, Jesper Krogh wrote: On 2010-05-28 23:47, Jan Urbański wrote: On 28/05/10 22:22, Tom Lane wrote: Now I tried to substitute some numbers there, and so assuming the English language has ~1e6 words H(W) is around 6.5. Let's assume the statistics target to be 100. I chose s as

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: Now I tried to substitute some numbers there, and so assuming the English language has ~1e6 words H(W) is around 6.5. Let's assume the statistics target to be 100. I chose s as 1/(st + 10)*H(W) because the top 10 English words will

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: Hm, I am now thinking that maybe this theory is flawed, because tsvecors contain only *unique* words, and Zipf's law is talking about words in documents in general. Normally a word like the would appear lots of times in a document,

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański
On 29/05/10 17:09, Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: Now I tried to substitute some numbers there, and so assuming the English language has ~1e6 words H(W) is around 6.5. Let's assume the statistics target to be 100. I chose s as 1/(st + 10)*H(W)

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: On 29/05/10 17:09, Tom Lane wrote: There is definitely something wrong with your math there. It's not possible for the 100'th most common word to have a frequency as high as 0.06 --- the ones above it presumably have larger

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański
On 29/05/10 17:34, Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: On 29/05/10 17:09, Tom Lane wrote: There is definitely something wrong with your math there. It's not possible for the 100'th most common word to have a frequency as high as 0.06 --- the ones

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: [ e of ] s/2 or s/3 look reasonable. The examples in the LC paper seem to all use e = s/10. Note the stated assumption e s. So, should I just write a patch that sets the bucket width and pruning count using 0.07 as the assumed

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-28 Thread Jan Urbański
On 28/05/10 04:47, Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: On 19/05/10 21:01, Jesper Krogh wrote: In practice, just cranking the statistics estimate up high enough seems to solve the problem, but doesn't there seem to be something wrong in how the

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-28 Thread Jan Urbański
On 28/05/10 04:47, Tom Lane wrote: I re-scanned that paper and realized that there is indeed something wrong with the way we are doing it. The paper says (last sentence in the definition of the algorithm, section 4.2): When a user requests a list of items with threshold s, we output

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-28 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: We follow the algorithm as written, the trouble starts when we want to output the result. The paper says which items from the D structure should be returned when the user asks for items that have frequencies higher than a threshold

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-28 Thread Jan Urbański
On 28/05/10 22:22, Tom Lane wrote: The idea that I was toying with is to assume a Zipfian distribution of the input (with some reasonable parameter), and use that to estimate what the frequency of the K'th element will be, where K is the target number of MCV entries or perhaps a bit more.

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-27 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: On 19/05/10 21:01, Jesper Krogh wrote: In practice, just cranking the statistics estimate up high enough seems to solve the problem, but doesn't there seem to be something wrong in how the statistics are collected? The algorithm to

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-25 Thread Alvaro Herrera
Excerpts from Jesper Krogh's message of mié may 19 15:01:18 -0400 2010: But the distribution is very flat at the end, the last 128 values are excactly 1.00189e-05 which means that any term sitting outside the array would get an estimate of 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows I don't

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-25 Thread Jan Urbański
On 19/05/10 21:01, Jesper Krogh wrote: The document base is arount 350.000 documents and I have set the statistics target on the tsvector column to 1000 since the 100 seems way of. So for tsvectors the statistics target means more or less at any time track at most 10 * target lexemes

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-25 Thread Jesper Krogh
On 26/05/2010, at 01.16, Jan Urbański wulc...@wulczer.org wrote: On 19/05/10 21:01, Jesper Krogh wrote: The document base is arount 350.000 documents and I have set the statistics target on the tsvector column to 1000 since the 100 seems way of. So for tsvectors the statistics target means