Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-31 Thread Jesper Krogh

On 2010-05-30 20:02, Jan Urbański wrote:

Here's a patch against recent git, but should apply to 8.4 sources as
well. It would be interesting to measure the memory and time needed to
analyse the table after applying it, because we will be now using a lot
bigger bucket size and I haven't done any performance impact testing on
it. I updated the initial comment block in compute_tsvector_stats, but
the prose could probably be improved.
   

Just a small follow up. I tried out the patch (or actually a fresh git
checkout) and it now gives very accurate results for both upper and
lower end of the MCE-histogram with a lower cutoff that doesn't
approach 2.

Thanks alot.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-31 Thread Tom Lane
Jesper Krogh jes...@krogh.cc writes:
 Just a small follow up. I tried out the patch (or actually a fresh git
 checkout) and it now gives very accurate results for both upper and
 lower end of the MCE-histogram with a lower cutoff that doesn't
 approach 2.

Good.  How much did the ANALYZE time change for your table?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-31 Thread Jesper Krogh

On 2010-05-31 20:38, Tom Lane wrote:

Jesper Kroghjes...@krogh.cc  writes:
   

Just a small follow up. I tried out the patch (or actually a fresh git
checkout) and it now gives very accurate results for both upper and
lower end of the MCE-histogram with a lower cutoff that doesn't
approach 2.
 

Good.  How much did the ANALYZE time change for your table?
   

1.3m documents.

New code ( 3 runs):
statistics target 1000 = 155s/124s/110s
statictics target 100 = 86s/55s/61s
Old code:
statistics target 1000 = 158s/101s/99s
statistics target 100 = 90s/29s/33s

Somehow I think that the first run is the relevant one, its pretty much 
a dead disk test,
and I wouldn't expect that random sampling of tuples would have any sane 
caching
effect in a production system. But it looks like the algoritm is a bit 
slower.


Thanks again..

Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jesper Krogh

On 2010-05-29 15:56, Jan Urbański wrote:

On 29/05/10 12:34, Jesper Krogh wrote:
   

On 2010-05-28 23:47, Jan Urbański wrote:
 

On 28/05/10 22:22, Tom Lane wrote:
Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.

I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
probably be stopwords, so we will never see them in the input.

   

I think you should skip the assumption about stop-words, users may
use something where they are needed in the index or have a language
than the typical.  (and they dont seem to influcence the math that much).
 

Turns out it has nearly linear influence on the bucket width and the
frequency necessary to survive the final pruning. I put some data in a
spreadsheet, results below.

   

How about setting it to some default in the first analyze round, but
setting it to the count of MCE's with a frequency of 1 in the subsequent
analyze rounds?


Isn't it the same type of logic that is used for collecting statistics
for array-types, say integer-arrays and text arrays?
 

AFAIK statistics for everything other than tsvectors are built based on
the values of whole rows. ts_typanalyze is the only typanalyze function
that takes the trouble of looping over the actual contents of each cell,
all the others just compare whole arrays (which means that for a text[]
field you will probably a quite useless MCV entry).
   


In another area, I was thinking about modelling a complete tree structure
where I would like to extract complete sub-btranches as int[] of the 
node-ids
in the set and then indexing them using gin. That seems like a really 
bad idea

based on the above information.

Wouldn't it make sense to treat array types like the tsvectors?


The results are attached in a text (CSV) file, to preserve formatting.
Based on them I'd like to propose top_stopwords and error_factor to be 100.
   


I know it is not percieved the correct way to do things, but I would
really like to keep the stop words in the dataset and have
something that is robust to that.

There are 2 issues for that wish, one is that the application
becomes more general. I really cannot stop my users from searching
for stop-words and they would expect the full set and not the empty 
set as

we get now.

The list of stop words is by no means an finite and would very
much depend on the input data set.

I would try to add the stop-words to the dictionary, so they still work, but
doesn't occupy that much space in the actual index. That seems to
solve the same task but with fewer issues for the users and a more 
generalized

code around it.


I can fairly easy try out patches or do other kind of testing.
 

I'll try to come up with a patch for you to try and fiddle with these
values before Monday.
   


Excellent.


testdb=# explain select id from testdb.reference where document_tsvector 
@@ plainto_tsquery('where') order by id limit 50;
NOTICE:  text-search query contains only stop words or doesn't contain 
lexemes, ignored

 QUERY PLAN
-
 Limit  (cost=41.02..41.03 rows=1 width=4)
   -  Sort  (cost=41.02..41.03 rows=1 width=4)
 Sort Key: id
 -  Bitmap Heap Scan on reference  (cost=34.50..41.01 rows=1 
width=4)
   Recheck Cond: (document_tsvector @@ 
plainto_tsquery('where'::text))
   -  Bitmap Index Scan on reference_fts_idx  
(cost=0.00..34.50 rows=1 width=0)
 Index Cond: (document_tsvector @@ 
plainto_tsquery('where'::text))

(7 rows)

testdb=# select id from testdb.reference where document_tsvector @@ 
plainto_tsquery('where') order by id limit 50;
NOTICE:  text-search query contains only stop words or doesn't contain 
lexemes, ignored
NOTICE:  text-search query contains only stop words or doesn't contain 
lexemes, ignored

 id

(0 rows)

testdb=#

I would indeed have expected the first 50 rows ordered by id..  trivial 
to extract.


--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Tom Lane
Jesper Krogh jes...@krogh.cc writes:
 On 2010-05-29 15:56, Jan Urbański wrote:
 AFAIK statistics for everything other than tsvectors are built based on
 the values of whole rows.

 Wouldn't it make sense to treat array types like the tsvectors?

Yeah, I have a personal TODO item to look into that in the future.

 The results are attached in a text (CSV) file, to preserve formatting.
 Based on them I'd like to propose top_stopwords and error_factor to be 100.

 I know it is not percieved the correct way to do things, but I would
 really like to keep the stop words in the dataset and have
 something that is robust to that.

Any stop words would already have been eliminated in the transformation
to tsvector (or not, if none were configured in the dictionary setup).
We should not assume that there are any in what ts_typanalyze is seeing.

I think the only relevance of stopwords to the current problem is that
*if* stopwords have been removed, we would see a Zipfian distribution
with the first few entries removed, and I'm not sure if it's still
really Zipfian afterwards.  However, we only need the assumption of
Zipfianness to compute a target frequency cutoff, so it's not like
things will be completely broken if the distribution isn't quite
Zipfian.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jan Urbański
 Jesper Krogh jes...@krogh.cc writes:
  On 2010-05-29 15:56, Jan Urbański wrote:
   AFAIK statistics for everything other than tsvectors are built based
   on the values of whole rows.
 
  Wouldn't it make sense to treat array types like the tsvectors?
 
 Yeah, I have a personal TODO item to look into that in the future.

There were plans to generalise the functions in ts_typanalyze and use LC for 
array types as well. If one day I'd find myself with a lot of free time I'd 
take a stab at that.

   The results are attached in a text (CSV) file, to preserve
   formatting. Based on them I'd like to propose top_stopwords and
   error_factor to be 100.
 
  I know it is not percieved the correct way to do things, but I would
  really like to keep the stop words in the dataset and have
  something that is robust to that.
 
 Any stop words would already have been eliminated in the transformation
 to tsvector (or not, if none were configured in the dictionary setup).
 We should not assume that there are any in what ts_typanalyze is seeing.

Yes, and as a side note, if you want to be indexing stopwords, just don't pass 
a stopword file when creating the text search dictionary (or pass a custom one).

 
 I think the only relevance of stopwords to the current problem is that
 *if* stopwords have been removed, we would see a Zipfian distribution
 with the first few entries removed, and I'm not sure if it's still
 really Zipfian afterwards.    However, we only need the assumption of
 Zipfianness to compute a target frequency cutoff, so it's not like
 things will be completely broken if the distribution isn't quite
 Zipfian.

That's why I was proposing to take s = 0.07 / (MCE-count + 10). But that 
probably doesn't matter much.

Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Tom Lane
Jan =?UTF-8?Q?Urba=C5=84ski?= wulc...@wulczer.org writes:
 I think the only relevance of stopwords to the current problem is that
 *if* stopwords have been removed, we would see a Zipfian distribution
 with the first few entries removed, and I'm not sure if it's still
 really Zipfian afterwards.

 That's why I was proposing to take s = 0.07 / (MCE-count + 10). But that 
 probably doesn't matter much.

Oh, now I get the point of that.  Yeah, it is probably a good idea.
If the input doesn't have stopwords removed, the worst that will happen
is we'll collect stats for an extra 10 or so lexemes, which will then
get thrown away when they don't fit into the MCE list.  +1.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jan Urbański
On 30/05/10 09:08, Jesper Krogh wrote:
 On 2010-05-29 15:56, Jan Urbański wrote:
 On 29/05/10 12:34, Jesper Krogh wrote:
 I can fairly easy try out patches or do other kind of testing.
  
 I'll try to come up with a patch for you to try and fiddle with these
 values before Monday.

Here's a patch against recent git, but should apply to 8.4 sources as
well. It would be interesting to measure the memory and time needed to
analyse the table after applying it, because we will be now using a lot
bigger bucket size and I haven't done any performance impact testing on
it. I updated the initial comment block in compute_tsvector_stats, but
the prose could probably be improved.

 testdb=# explain select id from testdb.reference where document_tsvector
 @@ plainto_tsquery('where') order by id limit 50;
 NOTICE:  text-search query contains only stop words or doesn't contain
 lexemes, ignored

That's orthogonal to the issue with the statistics collection, you just
need to modify your stopwords list (for instance make it empty).

Cheers,
Jan
*** src/backend/tsearch/ts_typanalyze.c
--- src/backend/tsearch/ts_typanalyze.c	2010-05-30 19:52:28.0 +0200
***
*** 92,112 
   *	http://www.vldb.org/conf/2002/S10P03.pdf
   *
   *	The Lossy Counting (aka LC) algorithm goes like this:
!  *	Let D be a set of triples (e, f, d), where e is an element value, f is
!  *	that element's frequency (occurrence count) and d is the maximum error in
!  *	f.	We start with D empty and process the elements in batches of size
!  *	w. (The batch size is also known as bucket size.) Let the current batch
!  *	number be b_current, starting with 1. For each element e we either
!  *	increment its f count, if it's already in D, or insert a new triple into D
!  *	with values (e, 1, b_current - 1). After processing each batch we prune D,
!  *	by removing from it all elements with f + d = b_current. Finally, we
!  *	gather elements with largest f.  The LC paper proves error bounds on f
!  *	dependent on the batch size w, and shows that the required table size
!  *	is no more than a few times w.
   *
!  *	We use a hashtable for the D structure and a bucket width of
!  *	statistics_target * 10, where 10 is an arbitrarily chosen constant,
!  *	meant to approximate the number of lexemes in a single tsvector.
   */
  static void
  compute_tsvector_stats(VacAttrStats *stats,
--- 92,126 
   *	http://www.vldb.org/conf/2002/S10P03.pdf
   *
   *	The Lossy Counting (aka LC) algorithm goes like this:
! 
!  *  Let s be a threshold frequency for an item and epsilon the error margin for
!  *  the frequency. Let D be a set of triples (e, f, d), where e is an element
!  *  value, f is that element's frequency (occurrence count) and d is the maximum
!  *  error in f. We start with D empty and process the elements in batches of
!  *  size w. (The batch size is also known as bucket size and is equal to
!  *  1/epsilon.) Let the current batch number be b_current, starting with 1. For
!  *  each element e we either increment its f count, if it's already in D, or
!  *  insert a new triple into D with values (e, 1, b_current - 1). After
!  *  processing each batch we prune D, by removing from it all elements with f +
!  *  d = b_current.  After the algorithm finishes we suppress all elements from
!  *  D that do not satisfy f = (s - e) * N, where N is the total number of
!  *  lexemes in the input.  The LC paper proves error bounds on f dependent on
!  *  the batch size w, and shows that the required table size is no more than a
!  *  few times w.
!  *
!  *  We set s to be the estimated frequency of the K'th word in a natural
!  *  language's frequency table, where K is the target number of entries in the
!  *  MCELEM array plus an arbitrary constant, meant to reflect the fact that the
!  *  top words in any language would usually be stopwords and we will not ever
!  *  see them in the input.  We assume that the distribution of word frequencies
!  *  follows Zipf's law with an exponent of 1.
   *
!  *  Assuming Zipfian distribution, thet frequency of the K'th element is equal
!  *  to 1/(K * H(W)) where H(n) is 1/2 + 1/3 + ... + 1/n and W is the number of
!  *  words in the language.  Putting one million as W, we get roughly 0.07/K,
!  *  assuming top 10 words are stopwords gives s = 0.07/(K + 10).  We set epsilon
!  *  to s/10 = 0.007/(K + 10), which gives a bucket width of (K + 10)/0.007 and
!  *  use a hashtable for the D structure.
   */
  static void
  compute_tsvector_stats(VacAttrStats *stats,
***
*** 114,120 
  	   int samplerows,
  	   double totalrows)
  {
! 	int			num_mcelem;
  	int			null_cnt = 0;
  	double		total_width = 0;
  
--- 128,134 
  	   int samplerows,
  	   double totalrows)
  {
! 	int			num_mcelem = stats-attr-attstattarget;
  	int			null_cnt = 0;
  	double		total_width = 0;
  
***
*** 133,147 
  	LexemeHashKey hash_key;
  	TrackItem  *item;
  
- 	/* We want statistics_target 

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 Here's a patch against recent git, but should apply to 8.4 sources as
 well. It would be interesting to measure the memory and time needed to
 analyse the table after applying it, because we will be now using a lot
 bigger bucket size and I haven't done any performance impact testing on
 it.

I did a little bit of testing using a dataset I had handy (a couple
hundred thousand publication titles) and found that ANALYZE seems to be
noticeably but far from intolerably slower --- it's almost the same
speed at statistics targets up to 100, and even at the max setting of
1 it's only maybe 25% slower.  However I'm not sure if this result
will scale to very large document sets, so more testing would be a good
idea.

I committed the attached revised version of the patch.  Revisions are
mostly minor but I did make two substantive changes:

* The patch changed the target number of mcelems from 10 *
statistics_target to just statistics_target.  I reverted that since
I don't think it was intended; at least we hadn't discussed it.

* I modified the final processing to avoid one qsort step if there are
fewer than num_mcelems hashtable entries that pass the cutoff frequency
filter, and in any case to sort only those entries that pass it rather
than all of them.  With the significantly larger number of hashtable
entries that will now be used, it seemed like a good thing to try to
cut the qsort overhead.

regards, tom lane

Index: ts_typanalyze.c
===
RCS file: /cvsroot/pgsql/src/backend/tsearch/ts_typanalyze.c,v
retrieving revision 1.8
diff -c -r1.8 ts_typanalyze.c
*** ts_typanalyze.c	2 Jan 2010 16:57:53 -	1.8
--- ts_typanalyze.c	30 May 2010 21:42:30 -
***
*** 92,112 
   *	http://www.vldb.org/conf/2002/S10P03.pdf
   *
   *	The Lossy Counting (aka LC) algorithm goes like this:
!  *	Let D be a set of triples (e, f, d), where e is an element value, f is
!  *	that element's frequency (occurrence count) and d is the maximum error in
!  *	f.	We start with D empty and process the elements in batches of size
!  *	w. (The batch size is also known as bucket size.) Let the current batch
!  *	number be b_current, starting with 1. For each element e we either
!  *	increment its f count, if it's already in D, or insert a new triple into D
!  *	with values (e, 1, b_current - 1). After processing each batch we prune D,
!  *	by removing from it all elements with f + d = b_current. Finally, we
!  *	gather elements with largest f.  The LC paper proves error bounds on f
!  *	dependent on the batch size w, and shows that the required table size
!  *	is no more than a few times w.
   *
!  *	We use a hashtable for the D structure and a bucket width of
!  *	statistics_target * 10, where 10 is an arbitrarily chosen constant,
!  *	meant to approximate the number of lexemes in a single tsvector.
   */
  static void
  compute_tsvector_stats(VacAttrStats *stats,
--- 92,140 
   *	http://www.vldb.org/conf/2002/S10P03.pdf
   *
   *	The Lossy Counting (aka LC) algorithm goes like this:
!  *	Let s be the threshold frequency for an item (the minimum frequency we
!  *	are interested in) and epsilon the error margin for the frequency. Let D
!  *	be a set of triples (e, f, delta), where e is an element value, f is that
!  *	element's frequency (actually, its current occurrence count) and delta is
!  *	the maximum error in f. We start with D empty and process the elements in
!  *	batches of size w. (The batch size is also known as bucket size and is
!  *	equal to 1/epsilon.) Let the current batch number be b_current, starting
!  *	with 1. For each element e we either increment its f count, if it's
!  *	already in D, or insert a new triple into D with values (e, 1, b_current
!  *	- 1). After processing each batch we prune D, by removing from it all
!  *	elements with f + delta = b_current.  After the algorithm finishes we
!  *	suppress all elements from D that do not satisfy f = (s - epsilon) * N,
!  *	where N is the total number of elements in the input.  We emit the
!  *	remaining elements with estimated frequency f/N.  The LC paper proves
!  *	that this algorithm finds all elements with true frequency at least s,
!  *	and that no frequency is overestimated or is underestimated by more than
!  *	epsilon.  Furthermore, given reasonable assumptions about the input
!  *	distribution, the required table size is no more than about 7 times w.
   *
!  *	We set s to be the estimated frequency of the K'th word in a natural
!  *	language's frequency table, where K is the target number of entries in
!  *	the MCELEM array plus an arbitrary constant, meant to reflect the fact
!  *	that the most common words in any language would usually be stopwords
!  *	so we will not actually see them in the input.  We assume that the
!  *	distribution of word frequencies (including the stopwords) follows 

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-30 Thread Jan Urbański
On 31/05/10 00:07, Tom Lane wrote:
 =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 I committed the attached revised version of the patch.  Revisions are
 mostly minor but I did make two substantive changes:
 
 * The patch changed the target number of mcelems from 10 *
 statistics_target to just statistics_target.  I reverted that since
 I don't think it was intended; at least we hadn't discussed it.

Yeah, that was accidental.

 * I modified the final processing to avoid one qsort step if there are
 fewer than num_mcelems hashtable entries that pass the cutoff frequency
 filter, and in any case to sort only those entries that pass it rather
 than all of them.  With the significantly larger number of hashtable
 entries that will now be used, it seemed like a good thing to try to
 cut the qsort overhead.

Make sense.

Thanks,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jesper Krogh

On 2010-05-28 23:47, Jan Urbański wrote:

On 28/05/10 22:22, Tom Lane wrote:
   

The idea that I was toying with is to assume a Zipfian distribution of
the input (with some reasonable parameter), and use that to estimate
what the frequency of the K'th element will be, where K is the target
number of MCV entries or perhaps a bit more.  Then use that estimate as
the s value, and set e = s/10 or so, and then w = 1/e and continue as
per the paper.  If the eventual filtering results in a lot less than the
target number of MCV entries (because the input wasn't so Zipfian), we
lose, but at least we have accurate numbers for the entries we kept.
 

I see what you mean, so the idea would be:

  * assume some value of W as the number of all words in the language
  * estimate s as 1/(st + 10)*H(W), where H(W) is the W'th harmonic
number and st is the statistics target, using Zipf's law
  * set e = s/10 and w = 1/e, that is 10/s
  * perform LC using that value of w
  * remove all elements for which f  (s-e)N, that is f  0.9*sN, where N
is the total number of lexemes processed
  * create the MCELEM entries as (item, f/N)

Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.

I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
probably be stopwords, so we will never see them in the input.
   


I think you should skip the assumption about stop-words, users may
use something where they are needed in the index or have a language
than the typical.  (and they dont seem to influcence the math that much).

Isn't it the same type of logic that is used for collecting statistics 
for

array-types, say integer-arrays and text arrays?

Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes
   


Im not sure I get this one.. does this mean that we prune everytime
we have collected 167 new datapoints .. that would seem too often
for me since that would roughly be once per row.


After that, we remove lexemes with f  0.9 * 0.06 * N = 0.054*N

So assuming that on average a tsvector has 154 elements and that we went
through 35017 rows (as it would be in Jesper's case, before he raised
the stats target from 100 to 1000), we will remove lexemes with f
0.054 * 35017 * 154 that is f  291201.37

I wonder what would happen if Jasper's case if we did that... And I
wonder how sound that maths is
   


If it means that I would get an accurate MCE-histogram for all
things that have an occourance of more than 5.4% of the rows
(given the samples chosen), then I think that would be really
reasonable.

I can fairly easy try out patches or do other kind of testing.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jesper Krogh

On 2010-05-28 04:47, Tom Lane wrote:

Cranking up the stats target actually makes it worse not better, since
low-frequency items are then more likely to get into the MCV list
   


I should have been more precise in the wording. Cranking up the
stats target gave me overall a better plan, but that is due to that
the range in the MCE histogram where the query-plan for my sample
query tipped from a Bitmap Index Scan on the gin-index to
Index Scan on a btree index actually became reliable.

This is more due to the nature of my application and test queries
than has anything to do with the correctness of the MCE histogram.

So cranking up the statistics target made the problem move
to somewhere, where it didnt matter that much to me.

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański
On 29/05/10 12:34, Jesper Krogh wrote:
 On 2010-05-28 23:47, Jan Urbański wrote:
 On 28/05/10 22:22, Tom Lane wrote:
 Now I tried to substitute some numbers there, and so assuming the
 English language has ~1e6 words H(W) is around 6.5. Let's assume the
 statistics target to be 100.

 I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
 probably be stopwords, so we will never see them in the input.

 I think you should skip the assumption about stop-words, users may
 use something where they are needed in the index or have a language
 than the typical.  (and they dont seem to influcence the math that much).

Turns out it has nearly linear influence on the bucket width and the
frequency necessary to survive the final pruning. I put some data in a
spreadsheet, results below.

 Isn't it the same type of logic that is used for collecting statistics
 for
 array-types, say integer-arrays and text arrays?

AFAIK statistics for everything other than tsvectors are built based on
the values of whole rows. ts_typanalyze is the only typanalyze function
that takes the trouble of looping over the actual contents of each cell,
all the others just compare whole arrays (which means that for a text[]
field you will probably a quite useless MCV entry).

 Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

 We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes

 
 Im not sure I get this one.. does this mean that we prune everytime
 we have collected 167 new datapoints .. that would seem too often
 for me since that would roughly be once per row.

Hm, if we pick s to be 0.06, we say that the K'th word in the English
language will have a frequency of 0.06, so if we want to have statistics
with an error of s/10, we can prune every 167 lexemes (K is the
statistics target, possibly +top_stopwords).

Hm, I am now thinking that maybe this theory is flawed, because tsvecors
contain only *unique* words, and Zipf's law is talking about words in
documents in general. Normally a word like the would appear lots of
times in a document, but (even ignoring the fact that it's a stopword
and so won't appear at all) in a tsvector it will be present only once.
This may or may not be a problem, not sure if such squashing of
occurences as tsvectors do skewes the distribution away from Zipfian or not.

Anyway, figuring that out would require some more math and thinking, and
to fix the problem at hand we can say Zipf is good enough.

 After that, we remove lexemes with f  0.9 * 0.06 * N = 0.054*N

 So assuming that on average a tsvector has 154 elements and that we went
 through 35017 rows (as it would be in Jesper's case, before he raised
 the stats target from 100 to 1000), we will remove lexemes with f
 0.054 * 35017 * 154 that is f  291201.37

 I wonder what would happen if Jasper's case if we did that... And I
 wonder how sound that maths is

 
 If it means that I would get an accurate MCE-histogram for all
 things that have an occourance of more than 5.4% of the rows
 (given the samples chosen), then I think that would be really
 reasonable.

Here's the spreadsheet spat out.

The variables are:
 * the statistics target
 * top stopwords
 * error factor

Where top stopwords is the number of top words in the English language
that would be stopwords. You can also think about it as the smudge
factor determinig how well do we trust that the distribution is Zipfian.
Theoretically if you want to keep X values in the MCE array, you should
discard inputs with frequency lower than the frequency of the X'th value
in a Zipfian distribution. If you would write out all English words and
their frequencies (according to Zipf's law), the top Y of them would be
stopwords. We want to discard words with frequency that's lower than X +
Y, and then we probably want to have some breathing space as well. That
cutoff frequency is called s in the LC algorithm.

Error factor determines the relation between s and e, since apparently
we want e to be proportional to s (e is the error from the LC
algorithm). It directly determines the bucket width, since the larger
the bucket, the more accurate the results will be, as there will be less
pruning going on.

There are also constants: H(len(eng)) is the harmonic number from Zipf's
law, that assuming 1e6 words in English is 6.5. tsvector length and rows
in sample are just some values to get concrete numbers out. They
influence the final pruning frequency, because the rule is f  (s-e)N
and N is the total number of lexemes seen

The results are attached in a text (CSV) file, to preserve formatting.
Based on them I'd like to propose top_stopwords and error_factor to be 100.

With your dataset this would mean pruning every 3076 lexemes and
discarding from the result all lexemes with  173507 occurrences. With
statistics target set to 1000 it would change to 16923 and 31546,
respectively.

 I can fairly easy try out patches or do other kind of testing.

I'll try to come up 

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 Now I tried to substitute some numbers there, and so assuming the
 English language has ~1e6 words H(W) is around 6.5. Let's assume the
 statistics target to be 100.

 I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
 probably be stopwords, so we will never see them in the input.

 Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

There is definitely something wrong with your math there.  It's not
possible for the 100'th most common word to have a frequency as high
as 0.06 --- the ones above it presumably have larger frequencies,
which makes the total quite a lot more than 1.0.

For the purposes here, I think it's probably unnecessary to use the more
complex statements of Zipf's law.  The interesting property is the rule
the k'th most common element occurs 1/k as often as the most common one.
So if you suppose the most common lexeme has frequency 0.1, the 100'th
most common should have frequency around 0.0001.  That's pretty crude
of course but it seems like the right ballpark.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 Hm, I am now thinking that maybe this theory is flawed, because tsvecors
 contain only *unique* words, and Zipf's law is talking about words in
 documents in general. Normally a word like the would appear lots of
 times in a document, but (even ignoring the fact that it's a stopword
 and so won't appear at all) in a tsvector it will be present only once.
 This may or may not be a problem, not sure if such squashing of
 occurences as tsvectors do skewes the distribution away from Zipfian or not.

Well, it's still going to approach Zipfian distribution over a large
number of documents.  In any case we are not really depending on Zipf's
law heavily with this approach.  The worst-case result if it's wrong
is that we end up with an MCE list shorter than our original target.
I suggest we could try this and see if we notice that happening a lot.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański
On 29/05/10 17:09, Tom Lane wrote:
 =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 Now I tried to substitute some numbers there, and so assuming the
 English language has ~1e6 words H(W) is around 6.5. Let's assume the
 statistics target to be 100.
 
 I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
 probably be stopwords, so we will never see them in the input.
 
 Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
 
 There is definitely something wrong with your math there.  It's not
 possible for the 100'th most common word to have a frequency as high
 as 0.06 --- the ones above it presumably have larger frequencies,
 which makes the total quite a lot more than 1.0.

Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014

With regards to my other mail this means that top_stopwords = 10 and
error_factor = 10 would mean bucket_width = 7150 and final prune value
of 6787.

Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 On 29/05/10 17:09, Tom Lane wrote:
 There is definitely something wrong with your math there.  It's not
 possible for the 100'th most common word to have a frequency as high
 as 0.06 --- the ones above it presumably have larger frequencies,
 which makes the total quite a lot more than 1.0.

 Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014

Um, apparently I can't do simple arithmetic first thing in the morning
either, cause I got my number wrong too ;-)

After a bit more research: if you use the basic form of Zipf's law
with a 1/k distribution, the first frequency has to be about 0.07
to make the total come out to 1.0 for a reasonable number of words.
So we could use s = 0.07 / K when we wanted a final list of K words.
Some people (including the LC paper) prefer a higher exponent, ie
1/k^S with S around 1.25.  That makes the F1 value around 0.22 which
seems awfully high for the type of data we're working with, so I think
the 1/k rule is probably what we want here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Jan Urbański
On 29/05/10 17:34, Tom Lane wrote:
 =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 On 29/05/10 17:09, Tom Lane wrote:
 There is definitely something wrong with your math there.  It's not
 possible for the 100'th most common word to have a frequency as high
 as 0.06 --- the ones above it presumably have larger frequencies,
 which makes the total quite a lot more than 1.0.
 
 Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014
 
 Um, apparently I can't do simple arithmetic first thing in the morning
 either, cause I got my number wrong too ;-)
 
 After a bit more research: if you use the basic form of Zipf's law
 with a 1/k distribution, the first frequency has to be about 0.07
 to make the total come out to 1.0 for a reasonable number of words.
 So we could use s = 0.07 / K when we wanted a final list of K words.
 Some people (including the LC paper) prefer a higher exponent, ie
 1/k^S with S around 1.25.  That makes the F1 value around 0.22 which
 seems awfully high for the type of data we're working with, so I think
 the 1/k rule is probably what we want here.

OK, I think we're getting somewhere :o)

I took the formula from Wikipedia's page on Zipf's law, assuming an
exponent of 1:

rank(K) = 1 / (K * H(W)) where H(x) = 1/2 + 1/3 + ... + 1/x, and W is
the number of words in English

Then I took the nth harmonic number expansion from the page on harmonic
numbers:

H(n) = ln(n) + 0.5772156649 + 1/2 * n^-1 + 1/12 * n^-2 + 1/120 * n^-4 +
O(n^-6)

Assuming 1 million words in English and the big-O term in the harmonic
expansion to be 1, we get H(1e6) = 14.3927, which would make the
frequency of the K'th word 1/14.3927 * K, that is 0.06948 * K (let's say
0.07).

Which brings me to the same result as yours, which in turn reassures me
a lot ;) My previous result was wrong because I used the wrong logarithm
base, go figure.

So with this, for statistics target of 100 we would predict the
frequency of the 100th word to be 0.0007. Assuming 154*35017 lexemes in
the input the bucket width and the final pruning value depend only on
the epsilon that we choose for the LC algorithm.

So, if we want e to be equal to s, we'd prune every 1/s = 1/0.0007 =
1428 lexemes and would not discard anything from the result. If we want
e to be s/2 we'd prune every 2857 lexemes and discard lexemes with
counts  1887. For s/3, s/4 etc the numbers look like this:

s/114280
s/228571887
s/342852516
s/457142831
s/571423019
s/685713145
s/71   3235
s/811428   3302
s/912857   3355

s/2 or s/3 look reasonable.

So, should I just write a patch that sets the bucket width and pruning
count using 0.07 as the assumed frequency of the most common word and
epsilon equal to s/2 or s/3?

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-29 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 [ e of ] s/2 or s/3 look reasonable.

The examples in the LC paper seem to all use e = s/10.  Note the stated
assumption e  s.

 So, should I just write a patch that sets the bucket width and pruning
 count using 0.07 as the assumed frequency of the most common word and
 epsilon equal to s/2 or s/3?

I'd go with s = 0.07 / desired-MCE-count and e = s / 10, at least for
a first cut to experiment with.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-28 Thread Jan Urbański
On 28/05/10 04:47, Tom Lane wrote:
 =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 On 19/05/10 21:01, Jesper Krogh wrote:
 In practice, just cranking the statistics estimate up high enough seems
 to solve the problem, but doesn't
 there seem to be something wrong in how the statistics are collected?
 
 The algorithm to determine most common vals does not do it accurately.
 That would require keeping all lexemes from the analysed tsvectors in
 memory, which would be impractical. If you want to learn more about the
 algorithm being used, try reading
 http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
 ts_typanalyze.c
 
 I re-scanned that paper and realized that there is indeed something
 wrong with the way we are doing it.

 So I think we have to fix this. 

Hm, I'll try to take another look this evening (CEST).

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-28 Thread Jan Urbański
On 28/05/10 04:47, Tom Lane wrote:
 I re-scanned that paper and realized that there is indeed something
 wrong with the way we are doing it.  The paper says (last sentence in
 the definition of the algorithm, section 4.2):
 
   When a user requests a list of items with threshold s, we output
   those entries in D where f = (s-e)N.
 
 What we are actually doing is emitting every entry with f = 2.  Since
 e is fixed at 1/w, this effectively means that we are setting s to be
 only fractionally greater than e, which means very large relative errors
 in the estimates.

I gave it a though and reread the paper, but since I already blundered
once, please verify me on this.

We follow the algorithm as written, the trouble starts when we want to
output the result. The paper says which items from the D structure
should be returned when the user asks for items that have frequencies
higher than a threshold s. What we want to put in the statistics table
are items accompanied by their frequencies, so we need to do some
reasoning before we can construct the result.

Say we have an item I with count f (taken from our D structure). The
total number of entries is N. The question would be: what would be the
minimum frequency that the user could specify, that would still make us
output this element. From

f = (s - e) * N

we can say it's

s = (f / N) + e

So if the user wants items that occur with frequency (f / N) + e or
less. This would mean that the corresponding value in the statistics
entry should be  I, (f / N) + e) 

The thing is, this doesn't change much, because currently we are putting
(f / N) there, and e is set to 1 / stats_target * 10, so the difference
would not be dramatic.

 Or, if you want it explained another way: we are emitting words whose f
 is very small and whose delta is very large, representing items that got
 added to the scan very late.  These really shouldn't be there because
 their true frequency is probably a lot less than the intended threshold.

Well, the idea of the algorithm is that if their frequency would have
been bigger, they would appear earlier and would survive the pruning, as
I understand it.

 The net effect of this is first that there are a lot of rather useless
 entries in the MCV list whose claimed frequency is quite small, like as
 little as two occurrences.  Their true frequency could be quite a bit
 more.  What's even worse is that we believe that the minimum of these
 claimed frequencies is a reliable upper bound for the frequencies of
 items not listed in the MCV list.

Per the algorithm it *is* the upper bound, if I got my maths correctly.
The last item in the list would not be returned if the requested
frequency was higher than the one that is associated to that item.

 So I think we have to fix this.  The right thing is to select s and e
 values that are actually meaningful in the terms of the paper, then
 compute w from that, and be sure to enforce the minimum f value
 specified in the algorithm ... ie, don't be afraid to throw away values
 in the final D table.

I we should definitely prune the table one last time in the very
probable case that the loop ended in the middle of two regularly
happening prunes.

As for selecting the algorithm parameters: we don't get to select s. We
do get to select e, but that's it. I have a feeling that our problems
are caused by thte fact, that the algorithm tries to answer the question
which elements occur more frequently than X and we actually want the
answer to the what's the frequency of each element. I've almost
convinced myself that the transformation between the answers to these
questions exists, but this trouble report keeps giving me doubts.

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-28 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 We follow the algorithm as written, the trouble starts when we want to
 output the result. The paper says which items from the D structure
 should be returned when the user asks for items that have frequencies
 higher than a threshold s. What we want to put in the statistics table
 are items accompanied by their frequencies, so we need to do some
 reasoning before we can construct the result.

Well, the estimated frequency is still just f/N.  The point is that we
must filter out items with small f values because they're probably
inaccurate --- in particular, anything with f  eN is completely
untrustworthy.

I agree that we currently aren't bothering to determine a specific s
value, but we probably need to do that in order to have a clear
understanding of what we are getting.

The idea that I was toying with is to assume a Zipfian distribution of
the input (with some reasonable parameter), and use that to estimate
what the frequency of the K'th element will be, where K is the target
number of MCV entries or perhaps a bit more.  Then use that estimate as
the s value, and set e = s/10 or so, and then w = 1/e and continue as
per the paper.  If the eventual filtering results in a lot less than the
target number of MCV entries (because the input wasn't so Zipfian), we
lose, but at least we have accurate numbers for the entries we kept.

 I we should definitely prune the table one last time in the very
 probable case that the loop ended in the middle of two regularly
 happening prunes.

The paper doesn't say that you need to do that.  I suspect if you work
through the math, you'll find out that the minimum-f filter skips
anything that would have been pruned anyway.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-28 Thread Jan Urbański
On 28/05/10 22:22, Tom Lane wrote:
 The idea that I was toying with is to assume a Zipfian distribution of
 the input (with some reasonable parameter), and use that to estimate
 what the frequency of the K'th element will be, where K is the target
 number of MCV entries or perhaps a bit more.  Then use that estimate as
 the s value, and set e = s/10 or so, and then w = 1/e and continue as
 per the paper.  If the eventual filtering results in a lot less than the
 target number of MCV entries (because the input wasn't so Zipfian), we
 lose, but at least we have accurate numbers for the entries we kept.

I see what you mean, so the idea would be:

 * assume some value of W as the number of all words in the language
 * estimate s as 1/(st + 10)*H(W), where H(W) is the W'th harmonic
number and st is the statistics target, using Zipf's law
 * set e = s/10 and w = 1/e, that is 10/s
 * perform LC using that value of w
 * remove all elements for which f  (s-e)N, that is f  0.9*sN, where N
is the total number of lexemes processed
 * create the MCELEM entries as (item, f/N)

Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.

I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
probably be stopwords, so we will never see them in the input.

Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes

After that, we remove lexemes with f  0.9 * 0.06 * N = 0.054*N

So assuming that on average a tsvector has 154 elements and that we went
through 35017 rows (as it would be in Jesper's case, before he raised
the stats target from 100 to 1000), we will remove lexemes with f 
0.054 * 35017 * 154 that is f  291201.37

I wonder what would happen if Jasper's case if we did that... And I
wonder how sound that maths is.

 I we should definitely prune the table one last time in the very
 probable case that the loop ended in the middle of two regularly
 happening prunes.
 
 The paper doesn't say that you need to do that.  I suspect if you work
 through the math, you'll find out that the minimum-f filter skips
 anything that would have been pruned anyway.

Possible.

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-27 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 On 19/05/10 21:01, Jesper Krogh wrote:
 In practice, just cranking the statistics estimate up high enough seems
 to solve the problem, but doesn't
 there seem to be something wrong in how the statistics are collected?

 The algorithm to determine most common vals does not do it accurately.
 That would require keeping all lexemes from the analysed tsvectors in
 memory, which would be impractical. If you want to learn more about the
 algorithm being used, try reading
 http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
 ts_typanalyze.c

I re-scanned that paper and realized that there is indeed something
wrong with the way we are doing it.  The paper says (last sentence in
the definition of the algorithm, section 4.2):

When a user requests a list of items with threshold s, we output
those entries in D where f = (s-e)N.

What we are actually doing is emitting every entry with f = 2.  Since
e is fixed at 1/w, this effectively means that we are setting s to be
only fractionally greater than e, which means very large relative errors
in the estimates.

Or, if you want it explained another way: we are emitting words whose f
is very small and whose delta is very large, representing items that got
added to the scan very late.  These really shouldn't be there because
their true frequency is probably a lot less than the intended threshold.

The net effect of this is first that there are a lot of rather useless
entries in the MCV list whose claimed frequency is quite small, like as
little as two occurrences.  Their true frequency could be quite a bit
more.  What's even worse is that we believe that the minimum of these
claimed frequencies is a reliable upper bound for the frequencies of
items not listed in the MCV list.  Both of these errors are manifest
in Jesper's description of his case, and they're also easy to reproduce
if you just take stats on any reasonable-size collection of documents.
Cranking up the stats target actually makes it worse not better, since
low-frequency items are then more likely to get into the MCV list.

So I think we have to fix this.  The right thing is to select s and e
values that are actually meaningful in the terms of the paper, then
compute w from that, and be sure to enforce the minimum f value
specified in the algorithm ... ie, don't be afraid to throw away values
in the final D table.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-25 Thread Alvaro Herrera
Excerpts from Jesper Krogh's message of mié may 19 15:01:18 -0400 2010:

 But the distribution is very flat at the end, the last 128 values are 
 excactly
 1.00189e-05
 which means that any term sitting outside the array would get an estimate of
 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

I don't know if this is related, but tsvector stats are computed and
stored per term, not per datum.  This is different from all other
datatypes.  Maybe there's code somewhere that's assuming per-datum and
coming up with the wrong estimates?  Or maybe the tsvector-specific code
contains a bug somewhere; maybe a rounding error?

-- 
Álvaro Herrera alvhe...@alvh.no-ip.org

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-25 Thread Jan Urbański
On 19/05/10 21:01, Jesper Krogh wrote:
 The document base is arount 350.000 documents and
 I have set the statistics target on the tsvector column
 to 1000 since the 100 seems way of.

So for tsvectors the statistics target means more or less at any time
track at most 10 * target lexemes simultaneously where track means
keeping them in memory while going through the tuples being analysed.

Remember that the measure is in lexemes, not whole tsvectors and the 10
factor is meant to approximate the average number of unique lexemes in a
tsvector. If your documents are very large, this might not be a good
approximation.

 # ANALYZE verbose reference (document_tsvector);
 INFO:  analyzing reference
 INFO:  reference: scanned 14486 of 14486 pages, containing 350174 live
 rows and 6027 dead rows; 30 rows in sample, 350174 estimated total rows
 ANALYZE
 
 Ok, so analyze allmost examined all rows. Looking into
 most_common_freqs I find
 # select count(unnest) from (select unnest(most_common_freqs) from
 pg_stats where attname = 'document_tsvector') as foo;
  count
 ---
   2810
 (1 row)

So the size of the most_common_freqs and most_common_vals rows in
pg_statistics for tsvectors has an upper bound of stats-target * 10
(for the same reasons as mentioned before) and holds lexemes (not whole
tsvectors). What happens also is that lexemes that where seen only one
while going through the analysed set are discarded, so that's why you
can actually get less entries in these arrays, even if your document set
is big.


 But the distribution is very flat at the end, the last 128 values are
 excactly
 1.00189e-05
 which means that any term sitting outside the array would get an
 estimate of
 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

Yeah, this might meant that you could try cranking up the stats target a
lot, to make the set of simulatenously tracked lexemes larger (it will
cost time and memory during analyse though). If the documents have
completely different contents, what can happen is that almost all
lexemes are only seen a few times and get removed during the pruning of
the working set. I have seen similar behaviour while working on the
typanalyze function for tsvectors.

 So far I have no idea if this is bad or good, so a couple of sample runs
 of stuff that
 is sitting outside the most_common_vals array:
 
 [gathered statistics suck]

 So the most_common_vals seems to contain a lot of values that should
 never have been kept in favor
 of other values that are more common.

 In practice, just cranking the statistics estimate up high enough seems
 to solve the problem, but doesn't
 there seem to be something wrong in how the statistics are collected?

The algorithm to determine most common vals does not do it accurately.
That would require keeping all lexemes from the analysed tsvectors in
memory, which would be impractical. If you want to learn more about the
algorithm being used, try reading
http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
ts_typanalyze.c

It would be interesting to know what's the average size of a tsvector in
your document set (ie. how many unique lexemes does a tsvector have on
average). In general, the tsvector typanalyze function is designed to
suck less than the constant factor that has been used previously, but it
only works really well on the most common lexemes (thus preventing most
gross misestimates). I'm not very surprised it misses the difference
between 1612/350174 and 4/350174 and I'm quite happy that is gets that
if you set the stats target really high :o)

There's always the possibility that there's some stupid bug there, but I
think you just set your expectations too high for the tsvector typanalze
function. If you could come up with a better way of doing tsvector
stats, that would be awesome - currently it's just doing its best to
prevent the most outrageous errors.

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tsvector pg_stats seems quite a bit off.

2010-05-25 Thread Jesper Krogh


On 26/05/2010, at 01.16, Jan Urbański wulc...@wulczer.org wrote:


On 19/05/10 21:01, Jesper Krogh wrote:

The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.


So for tsvectors the statistics target means more or less at any time
track at most 10 * target lexemes simultaneously where track  
means

keeping them in memory while going through the tuples being analysed.

Remember that the measure is in lexemes, not whole tsvectors and the  
10
factor is meant to approximate the average number of unique lexemes  
in a

tsvector. If your documents are very large, this might not be a good
approximation.


I just did a avg(length(document_tsvector)) which is 154
Doc count is 1.3m now in my sample set.


But the distribution is very flat at the end, the last 128 values  
are

excactly
1.00189e-05
which means that any term sitting outside the array would get an
estimate of
1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows


Yeah, this might meant that you could try cranking up the stats  
target a

lot, to make the set of simulatenously tracked lexemes larger (it will
cost time and memory during analyse though). If the documents have
completely different contents, what can happen is that almost all
lexemes are only seen a few times and get removed during the pruning  
of

the working set. I have seen similar behaviour while working on the
typanalyze function for tsvectors.


I Think i would prefer something less magic   I Can increase the  
statistics target and get more reliable data but that increases also  
the amount of tuples being picked out for analysis which is really  
time consuming.


But that also means that what gets stored as the lower bound of the  
historgram isn't anywhere near the lower bound, more the lower bound  
of the artificial histogram that happened after the last pruning.


I Would suggest that the pruning in the end should be aware of this.  
Perhaps by keeping track of the least frequent value that never got  
pruned and using that as the last pruning ans lower bound?


Thanks a lot for the explanation it fits fairly well why i couldn't  
construct a simple test set that had the problem.




So far I have no idea if this is bad or good, so a couple of sample  
runs

of stuff that
is sitting outside the most_common_vals array:

[gathered statistics suck]


So the most_common_vals seems to contain a lot of values that  
should

never have been kept in favor
of other values that are more common.


In practice, just cranking the statistics estimate up high enough  
seems

to solve the problem, but doesn't
there seem to be something wrong in how the statistics are collected?


The algorithm to determine most common vals does not do it accurately.
That would require keeping all lexemes from the analysed tsvectors in
memory, which would be impractical. If you want to learn more about  
the

algorithm being used, try reading
http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
ts_typanalyze.c


I'll do some Reading

Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers