On 2010-05-30 20:02, Jan Urbański wrote:
Here's a patch against recent git, but should apply to 8.4 sources as
well. It would be interesting to measure the memory and time needed to
analyse the table after applying it, because we will be now using a lot
bigger bucket size and I haven't done any
Jesper Krogh jes...@krogh.cc writes:
Just a small follow up. I tried out the patch (or actually a fresh git
checkout) and it now gives very accurate results for both upper and
lower end of the MCE-histogram with a lower cutoff that doesn't
approach 2.
Good. How much did the ANALYZE time
On 2010-05-31 20:38, Tom Lane wrote:
Jesper Kroghjes...@krogh.cc writes:
Just a small follow up. I tried out the patch (or actually a fresh git
checkout) and it now gives very accurate results for both upper and
lower end of the MCE-histogram with a lower cutoff that doesn't
approach 2.
On 2010-05-29 15:56, Jan Urbański wrote:
On 29/05/10 12:34, Jesper Krogh wrote:
On 2010-05-28 23:47, Jan Urbański wrote:
On 28/05/10 22:22, Tom Lane wrote:
Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's
Jesper Krogh jes...@krogh.cc writes:
On 2010-05-29 15:56, Jan UrbaÅski wrote:
AFAIK statistics for everything other than tsvectors are built based on
the values of whole rows.
Wouldn't it make sense to treat array types like the tsvectors?
Yeah, I have a personal TODO item to look into that
Jesper Krogh jes...@krogh.cc writes:
On 2010-05-29 15:56, Jan Urbański wrote:
AFAIK statistics for everything other than tsvectors are built based
on the values of whole rows.
Wouldn't it make sense to treat array types like the tsvectors?
Yeah, I have a personal TODO item to look
Jan =?UTF-8?Q?Urba=C5=84ski?= wulc...@wulczer.org writes:
I think the only relevance of stopwords to the current problem is that
*if* stopwords have been removed, we would see a Zipfian distribution
with the first few entries removed, and I'm not sure if it's still
really Zipfian afterwards.
On 30/05/10 09:08, Jesper Krogh wrote:
On 2010-05-29 15:56, Jan Urbański wrote:
On 29/05/10 12:34, Jesper Krogh wrote:
I can fairly easy try out patches or do other kind of testing.
I'll try to come up with a patch for you to try and fiddle with these
values before Monday.
Here's a
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
Here's a patch against recent git, but should apply to 8.4 sources as
well. It would be interesting to measure the memory and time needed to
analyse the table after applying it, because we will be now using a lot
bigger bucket size
On 31/05/10 00:07, Tom Lane wrote:
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
I committed the attached revised version of the patch. Revisions are
mostly minor but I did make two substantive changes:
* The patch changed the target number of mcelems from 10 *
On 2010-05-28 23:47, Jan Urbański wrote:
On 28/05/10 22:22, Tom Lane wrote:
The idea that I was toying with is to assume a Zipfian distribution of
the input (with some reasonable parameter), and use that to estimate
what the frequency of the K'th element will be, where K is the target
On 2010-05-28 04:47, Tom Lane wrote:
Cranking up the stats target actually makes it worse not better, since
low-frequency items are then more likely to get into the MCV list
I should have been more precise in the wording. Cranking up the
stats target gave me overall a better plan, but that
On 29/05/10 12:34, Jesper Krogh wrote:
On 2010-05-28 23:47, Jan Urbański wrote:
On 28/05/10 22:22, Tom Lane wrote:
Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.
I chose s as
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.
I chose s as 1/(st + 10)*H(W) because the top 10 English words will
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
Hm, I am now thinking that maybe this theory is flawed, because tsvecors
contain only *unique* words, and Zipf's law is talking about words in
documents in general. Normally a word like the would appear lots of
times in a document,
On 29/05/10 17:09, Tom Lane wrote:
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.
I chose s as 1/(st + 10)*H(W)
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
On 29/05/10 17:09, Tom Lane wrote:
There is definitely something wrong with your math there. It's not
possible for the 100'th most common word to have a frequency as high
as 0.06 --- the ones above it presumably have larger
On 29/05/10 17:34, Tom Lane wrote:
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
On 29/05/10 17:09, Tom Lane wrote:
There is definitely something wrong with your math there. It's not
possible for the 100'th most common word to have a frequency as high
as 0.06 --- the ones
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
[ e of ] s/2 or s/3 look reasonable.
The examples in the LC paper seem to all use e = s/10. Note the stated
assumption e s.
So, should I just write a patch that sets the bucket width and pruning
count using 0.07 as the assumed
On 28/05/10 04:47, Tom Lane wrote:
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
On 19/05/10 21:01, Jesper Krogh wrote:
In practice, just cranking the statistics estimate up high enough seems
to solve the problem, but doesn't
there seem to be something wrong in how the
On 28/05/10 04:47, Tom Lane wrote:
I re-scanned that paper and realized that there is indeed something
wrong with the way we are doing it. The paper says (last sentence in
the definition of the algorithm, section 4.2):
When a user requests a list of items with threshold s, we output
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
We follow the algorithm as written, the trouble starts when we want to
output the result. The paper says which items from the D structure
should be returned when the user asks for items that have frequencies
higher than a threshold
On 28/05/10 22:22, Tom Lane wrote:
The idea that I was toying with is to assume a Zipfian distribution of
the input (with some reasonable parameter), and use that to estimate
what the frequency of the K'th element will be, where K is the target
number of MCV entries or perhaps a bit more.
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
On 19/05/10 21:01, Jesper Krogh wrote:
In practice, just cranking the statistics estimate up high enough seems
to solve the problem, but doesn't
there seem to be something wrong in how the statistics are collected?
The algorithm to
Excerpts from Jesper Krogh's message of mié may 19 15:01:18 -0400 2010:
But the distribution is very flat at the end, the last 128 values are
excactly
1.00189e-05
which means that any term sitting outside the array would get an estimate of
1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows
I don't
On 19/05/10 21:01, Jesper Krogh wrote:
The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.
So for tsvectors the statistics target means more or less at any time
track at most 10 * target lexemes
On 26/05/2010, at 01.16, Jan Urbański wulc...@wulczer.org wrote:
On 19/05/10 21:01, Jesper Krogh wrote:
The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.
So for tsvectors the statistics target means
27 matches
Mail list logo