Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-31 Thread Jesper Krogh
On 2010-05-31 20:38, Tom Lane wrote: Jesper Krogh writes: Just a small follow up. I tried out the patch (or actually a fresh git checkout) and it now gives very accurate results for both upper and lower end of the MCE-histogram with a lower cutoff that doesn't approach 2. Good. How

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-31 Thread Tom Lane
Jesper Krogh writes: > Just a small follow up. I tried out the patch (or actually a fresh git > checkout) and it now gives very accurate results for both upper and > lower end of the MCE-histogram with a lower cutoff that doesn't > approach 2. Good. How much did the ANALYZE time change for your

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-31 Thread Jesper Krogh
On 2010-05-30 20:02, Jan Urbański wrote: Here's a patch against recent git, but should apply to 8.4 sources as well. It would be interesting to measure the memory and time needed to analyse the table after applying it, because we will be now using a lot bigger bucket size and I haven't done any p

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jan Urbański
On 31/05/10 00:07, Tom Lane wrote: > =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: > I committed the attached revised version of the patch. Revisions are > mostly minor but I did make two substantive changes: > > * The patch changed the target number of mcelems from 10 * > statistics_target to just s

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: > Here's a patch against recent git, but should apply to 8.4 sources as > well. It would be interesting to measure the memory and time needed to > analyse the table after applying it, because we will be now using a lot > bigger bucket size and I haven't don

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jan Urbański
On 30/05/10 09:08, Jesper Krogh wrote: > On 2010-05-29 15:56, Jan Urbański wrote: >> On 29/05/10 12:34, Jesper Krogh wrote: >>> I can "fairly easy" try out patches or do other kind of testing. >>> >> I'll try to come up with a patch for you to try and fiddle with these >> values before Monday

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Tom Lane
Jan =?UTF-8?Q?Urba=C5=84ski?= writes: >> I think the only relevance of stopwords to the current problem is that >> *if* stopwords have been removed, we would see a Zipfian distribution >> with the first few entries removed, and I'm not sure if it's still >> really Zipfian afterwards. > That's why

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jan Urbański
> Jesper Krogh writes: > > On 2010-05-29 15:56, Jan Urbański wrote: > > > AFAIK statistics for everything other than tsvectors are built based > > > on the values of whole rows. > > > Wouldn't it make sense to treat array types like the tsvectors? > > Yeah, I have a personal TODO item to look in

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Tom Lane
Jesper Krogh writes: > On 2010-05-29 15:56, Jan Urbański wrote: >> AFAIK statistics for everything other than tsvectors are built based on >> the values of whole rows. > Wouldn't it make sense to treat array types like the tsvectors? Yeah, I have a personal TODO item to look into that in the fu

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jesper Krogh
On 2010-05-29 15:56, Jan Urbański wrote: On 29/05/10 12:34, Jesper Krogh wrote: On 2010-05-28 23:47, Jan Urbański wrote: On 28/05/10 22:22, Tom Lane wrote: Now I tried to substitute some numbers there, and so assuming the English language has ~1e6 words H(W) is around 6.5. Let's assum

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: > [ e of ] s/2 or s/3 look reasonable. The examples in the LC paper seem to all use e = s/10. Note the stated assumption e << s. > So, should I just write a patch that sets the bucket width and pruning > count using 0.07 as the assumed frequency of the m

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański
On 29/05/10 17:34, Tom Lane wrote: > =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: >> On 29/05/10 17:09, Tom Lane wrote: >>> There is definitely something wrong with your math there. It's not >>> possible for the 100'th most common word to have a frequency as high >>> as 0.06 --- the ones above it pre

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: > On 29/05/10 17:09, Tom Lane wrote: >> There is definitely something wrong with your math there. It's not >> possible for the 100'th most common word to have a frequency as high >> as 0.06 --- the ones above it presumably have larger frequencies, >> which

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański
On 29/05/10 17:09, Tom Lane wrote: > =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: >> Now I tried to substitute some numbers there, and so assuming the >> English language has ~1e6 words H(W) is around 6.5. Let's assume the >> statistics target to be 100. > >> I chose s as 1/(st + 10)*H(W) because the

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: > Hm, I am now thinking that maybe this theory is flawed, because tsvecors > contain only *unique* words, and Zipf's law is talking about words in > documents in general. Normally a word like "the" would appear lots of > times in a document, but (even ignor

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: > Now I tried to substitute some numbers there, and so assuming the > English language has ~1e6 words H(W) is around 6.5. Let's assume the > statistics target to be 100. > I chose s as 1/(st + 10)*H(W) because the top 10 English words will most > probably

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański
On 29/05/10 12:34, Jesper Krogh wrote: > On 2010-05-28 23:47, Jan Urbański wrote: >> On 28/05/10 22:22, Tom Lane wrote: >> Now I tried to substitute some numbers there, and so assuming the >> English language has ~1e6 words H(W) is around 6.5. Let's assume the >> statistics target to be 100. >> >>

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jesper Krogh
On 2010-05-28 04:47, Tom Lane wrote: Cranking up the stats target actually makes it worse not better, since low-frequency items are then more likely to get into the MCV list I should have been more precise in the wording. Cranking up the stats target gave me overall a "better plan", but tha

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jesper Krogh
On 2010-05-28 23:47, Jan Urbański wrote: On 28/05/10 22:22, Tom Lane wrote: The idea that I was toying with is to assume a Zipfian distribution of the input (with some reasonable parameter), and use that to estimate what the frequency of the K'th element will be, where K is the target number

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-28 Thread Jan Urbański
On 28/05/10 22:22, Tom Lane wrote: > The idea that I was toying with is to assume a Zipfian distribution of > the input (with some reasonable parameter), and use that to estimate > what the frequency of the K'th element will be, where K is the target > number of MCV entries or perhaps a bit more.

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-28 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: > We follow the algorithm as written, the trouble starts when we want to > output the result. The paper says which items from the D structure > should be returned when the user asks for items that have frequencies > higher than a threshold s. What we want t

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-28 Thread Jan Urbański
On 28/05/10 04:47, Tom Lane wrote: > I re-scanned that paper and realized that there is indeed something > wrong with the way we are doing it. The paper says (last sentence in > the definition of the algorithm, section 4.2): > > When a user requests a list of items with threshold s, we outp

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-28 Thread Jan Urbański
On 28/05/10 04:47, Tom Lane wrote: > =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: >> On 19/05/10 21:01, Jesper Krogh wrote: >>> In practice, just cranking the statistics estimate up high enough seems >>> to solve the problem, but doesn't >>> there seem to be something wrong in how the statistics are c

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-27 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= writes: > On 19/05/10 21:01, Jesper Krogh wrote: >> In practice, just cranking the statistics estimate up high enough seems >> to solve the problem, but doesn't >> there seem to be something wrong in how the statistics are collected? > The algorithm to determine m

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-25 Thread Jesper Krogh
On 26/05/2010, at 01.16, Jan Urbański wrote: On 19/05/10 21:01, Jesper Krogh wrote: The document base is arount 350.000 documents and I have set the statistics target on the tsvector column to 1000 since the 100 seems way of. So for tsvectors the statistics target means more or less "at any

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-25 Thread Jan Urbański
On 19/05/10 21:01, Jesper Krogh wrote: > The document base is arount 350.000 documents and > I have set the statistics target on the tsvector column > to 1000 since the 100 seems way of. So for tsvectors the statistics target means more or less "at any time track at most 10 * lexemes simultaneous

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-25 Thread Alvaro Herrera
Excerpts from Jesper Krogh's message of mié may 19 15:01:18 -0400 2010: > But the distribution is very "flat" at the end, the last 128 values are > excactly > 1.00189e-05 > which means that any term sitting outside the array would get an estimate of > 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows I d

[HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-19 Thread Jesper Krogh
Hi. I am working on getting full-text-search to work and have come across something I think look a bit strange. The document base is arount 350.000 documents and I have set the statistics target on the tsvector column to 1000 since the 100 seems way of. # ANALYZE verbose reference (document_tsv