Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-09-19 Thread ju219721

Quoting Tom Lane [EMAIL PROTECTED]:


I wrote:

...  One possibly
performance-relevant point is to use DatumGetTextPP for detoasting;
you've already paid the costs by using VARDATA_ANY etc, so you might
as well get the benefit.


Actually, wait a second.  That code doesn't work at all on toasted data,
because it's trying to use VARSIZE_ANY_EXHDR() before detoasting.
That would give you the physical datum size (eg the size of the toast
pointer), not the number you need.

However, this is actually not a problem because we know that the data
came from an array in pg_statistic, which means the individual members
*can't be toasted*.  At least they can't be compressed or out-of-line.
We'd do that at the array level, it's not sensible to do it on an
individual array member.

I think that right at the moment the array stuff doesn't permit short
headers either, but it would make sense to relax that someday.  So I'd
recommend that your code allow either regular or short headers, but not
worry about compression or out-of-line storage.

Which boils down to: keep using VARSIZE_ANY_EXHDR/VARDATA_ANY, but
forget the detoasting step.  Maybe put in
Assert(!VARATT_IS_COMPRESSED(datum)  !VARATT_IS_EXTERNAL(datum))
instead.


Tom, Heikki, everyone,

just to let you know - I haven't forgot about this, but right now I'm  
porting myself to another country, so things are quite hectic ;)


I'll pick up the patch in a couple of weeks, so I'm sure it will be  
ready for the November CF.


Thanks,
Jan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-09-19 Thread Jan Urbański

[EMAIL PROTECTED] wrote:

Quoting Tom Lane [EMAIL PROTECTED]:


I wrote:

...  One possibly
performance-relevant point is to use DatumGetTextPP for detoasting;
you've already paid the costs by using VARDATA_ANY etc, so you might
as well get the benefit.


Actually, wait a second.  That code doesn't work at all on toasted data,
because it's trying to use VARSIZE_ANY_EXHDR() before detoasting.
That would give you the physical datum size (eg the size of the toast
pointer), not the number you need.

However, this is actually not a problem because we know that the data
came from an array in pg_statistic, which means the individual members
*can't be toasted*.  At least they can't be compressed or out-of-line.
We'd do that at the array level, it's not sensible to do it on an
individual array member.

I think that right at the moment the array stuff doesn't permit short
headers either, but it would make sense to relax that someday.  So I'd
recommend that your code allow either regular or short headers, but not
worry about compression or out-of-line storage.

Which boils down to: keep using VARSIZE_ANY_EXHDR/VARDATA_ANY, but
forget the detoasting step.  Maybe put in
Assert(!VARATT_IS_COMPRESSED(datum)  !VARATT_IS_EXTERNAL(datum))
instead.


Well whaddya know. It turned out that my new company has a 
'Fridays-are-for-any-opensource-hacking-you-like' policy, so I got a 
full day to work on the patch.
Attached is a version that stores the minimal and maximal frequencies in 
the Numbers array, has the aforementioned assertion and more nicely 
ordered functions in ts_selfuncs.c.


I tested it with oprofile and
pgbench -n -f tssel-bench.sql -t 1000 postgres
with tssel-bench.sql containing
select * from manuals where tsvector @@ to_tsquery('foo');

manuals has ~700 rows and 'foo' does not appear in any of the lexemes.

The results are:
=== CVS HEAD ===
scaling factor: 1
query mode: simple
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 13.399584 (including connections establishing)
tps = 13.399972 (excluding connections establishing)

7406934.7779  pglz_decompress
3856018.1052  tsvectorout
7688  3.6098  pg_mblen
5366  2.5195  hash_search_with_hash_value
4833  2.2693  pg_utf_mblen
4718  2.2153  AllocSetAlloc
4041  1.8974  index_getnext
3100  1.4556  LWLockAcquire
3056  1.4349  hash_any
2843  1.3349  LWLockRelease
2611  1.2260  AllocSetFree
2126  0.9982  tsCompareString
2121  0.9959  _bt_compare
1830  0.8592  LockAcquire
1517  0.7123  toast_fetch_datum
1503  0.7057  .plt
1338  0.6282  _bt_checkkeys
1332  0.6254  FunctionCall2
1233  0.5789  ReadBuffer_common
1185  0.5564  slot_deform_tuple
1157  0.5433  TParserGet
1123  0.5273  LockRelease


=== PATCH ===
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 13.309346 (including connections establishing)
tps = 13.309761 (excluding connections establishing)

171514   35.0802  pglz_decompress
8723117.8416  tsvectorout
17107 3.4989  pg_mblen
12514 2.5595  hash_search_with_hash_value
11124 2.2752  pg_utf_mblen
10739 2.1965  AllocSetAlloc
8534  1.7455  index_getnext
7460  1.5258  LWLockAcquire
6876  1.4064  LWLockRelease
6622  1.3544  hash_any
5773  1.1808  AllocSetFree
5210  1.0656  _bt_compare
4849  0.9918  tsCompareString
4043  0.8269  LockAcquire
3535  0.7230  .plt
3246  0.6639  _bt_checkkeys
3170  0.6484  toast_fetch_datum
3057  0.6253  FunctionCall2
2815  0.5758  ReadBuffer_common
2767  0.5659  TParserGet
2605  0.5328  slot_deform_tuple
2567  0.5250  MemoryContextAlloc

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin
*** a/doc/src/sgml/catalogs.sgml
--- b/doc/src/sgml/catalogs.sgml
***
*** 6664,6669 
--- 6664,6671 
 A list of the frequencies of the most common values or elements,
 i.e., number of occurrences of each divided by total number of rows.
 (NULL when structfieldmost_common_vals/structfield is.)
+For some datatypes such as typetsvector/, it can also store some
+additional information, i.e. be longer than the most_common_vals array.
/entry
   /row
  
*** a/src/backend/tsearch/Makefile
--- b/src/backend/tsearch/Makefile
***
*** 19,25  DICTFILES=synonym_sample.syn thesaurus_sample.ths 
hunspell_sample.affix \
  OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o dict_thesaurus.o \
dict_ispell.o regis.o spell.o \
!   to_tsany.o ts_typanalyze.o ts_utils.o
  
  include $(top_srcdir)/src/backend/common.mk
  
--- 19,25 
  OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o 

Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-09-19 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] writes:
 [EMAIL PROTECTED] wrote:
 Well whaddya know. It turned out that my new company has a 
 'Fridays-are-for-any-opensource-hacking-you-like' policy, so I got a 
 full day to work on the patch.

Hm, does their name start with G?

 Attached is a version that stores the minimal and maximal frequencies in 
 the Numbers array, has the aforementioned assertion and more nicely 
 ordered functions in ts_selfuncs.c.

Excellent, I'll get to work on this version.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-09-19 Thread Jan Urbański

Tom Lane wrote:

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] writes:

[EMAIL PROTECTED] wrote:
Well whaddya know. It turned out that my new company has a 
'Fridays-are-for-any-opensource-hacking-you-like' policy, so I got a 
full day to work on the patch.


Hm, does their name start with G?


No ;) It's called Flumotion (http://www.flumotion.com/eng/).

Attached is a version that stores the minimal and maximal frequencies in 
the Numbers array, has the aforementioned assertion and more nicely 
ordered functions in ts_selfuncs.c.


Excellent, I'll get to work on this version.


Great, thanks.

Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-09-19 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] writes:
 Attached is a version that stores the minimal and maximal frequencies in 
 the Numbers array, has the aforementioned assertion and more nicely 
 ordered functions in ts_selfuncs.c.

Applied with some small corrections.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-09-02 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] writes:
 Pre-sorting introduced one problem (see XXX in code): it's not easy 
 anymore to get the minimal frequency of MCELEM values. I was using it to 
 assert that the selectivity of a tsquery node containing a lexeme not in 
 MCELEM is no more that min(MCELEM freqs) / 2. That's only significant 
 when the minimum frequency is less than DEFAULT_TS_SEL * 2, so I'm kind 
 of inclined to ignore it and maybe drop a comment in the code that this 
 may be a potential problem.

This is easily fixed: there is nothing saying that a pg_statistic slot's
contents must contain the same numbers of Values and Numbers.  Make the
numbers array have one extra element and store the min frequency there.
Maybe it'd be worth having 2 extra elements and dropping the max in,
as well.  I don't immediately have a use for it, but it'll be a lot
harder to add it later if we don't put it in now.

 If nothing is fundamentally broken with this, I'll repeat my profiling 
 tests to see if anything has been gained.

I don't have much except minor stylistic gripes (like the ordering of
the functions in ts_selfuncs.c seeming a bit random).  One possibly
performance-relevant point is to use DatumGetTextPP for detoasting;
you've already paid the costs by using VARDATA_ANY etc, so you might
as well get the benefit.

Please fix the above and do the performance testing ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-09-02 Thread Tom Lane
I wrote:
 ...  One possibly
 performance-relevant point is to use DatumGetTextPP for detoasting;
 you've already paid the costs by using VARDATA_ANY etc, so you might
 as well get the benefit.

Actually, wait a second.  That code doesn't work at all on toasted data,
because it's trying to use VARSIZE_ANY_EXHDR() before detoasting.
That would give you the physical datum size (eg the size of the toast
pointer), not the number you need.

However, this is actually not a problem because we know that the data
came from an array in pg_statistic, which means the individual members
*can't be toasted*.  At least they can't be compressed or out-of-line.
We'd do that at the array level, it's not sensible to do it on an
individual array member.

I think that right at the moment the array stuff doesn't permit short
headers either, but it would make sense to relax that someday.  So I'd
recommend that your code allow either regular or short headers, but not
worry about compression or out-of-line storage.

Which boils down to: keep using VARSIZE_ANY_EXHDR/VARDATA_ANY, but
forget the detoasting step.  Maybe put in
Assert(!VARATT_IS_COMPRESSED(datum)  !VARATT_IS_EXTERNAL(datum))
instead.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-27 Thread Jan Urbański

Tom Lane wrote:

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] writes:

Simon Riggs wrote:

put it in a file called selfuncs_ts.c so it is similar to the existing
filename?



I followed the pattern of ts_parse.c, ts_utils.c and so on.
Also, I see geo_selfuncs.c. No big deal, though, I can move it.


Given the precedent of geo_selfuncs.c, I think you were right the
first time.  A more interesting question is whether it should just
get folded into selfuncs.c ...


selfuncs.c is a 5.8k lines beast, I felt a bit intimidated when first 
opened it. The code in ts_selfuncs.c relies strongly on what the code in 
ts_typanalyze.c does and that was another reason for putting in in its 
own file next to ts_typanalyze.c. I don't really care to be honest, 
might as well stick it into selfuncs.c.


--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-27 Thread Heikki Linnakangas

Jan Urbański wrote:

Tom Lane wrote:
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] 
writes:

Simon Riggs wrote:

put it in a file called selfuncs_ts.c so it is similar to the existing
filename?



I followed the pattern of ts_parse.c, ts_utils.c and so on.
Also, I see geo_selfuncs.c. No big deal, though, I can move it.


Given the precedent of geo_selfuncs.c, I think you were right the
first time.  A more interesting question is whether it should just
get folded into selfuncs.c ...


selfuncs.c is a 5.8k lines beast, I felt a bit intimidated when first 
opened it. The code in ts_selfuncs.c relies strongly on what the code in 
ts_typanalyze.c does and that was another reason for putting in in its 
own file next to ts_typanalyze.c. I don't really care to be honest, 
might as well stick it into selfuncs.c.


I would leave the code in ts_selfuncs.c like you did. The stuff in 
selfuncs.c is pretty generic, not related to any specific data type. 
With the exception of the regex and LIKE selectivity functions, but 
those should rather be moved to a separate file, say pattern_selfuncs.c. 
There's also plenty of datatype-specific code in convert_to_scalar() and 
its subroutines, but even the comment there says that it's a hack.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-27 Thread Simon Riggs

On Tue, 2008-08-26 at 12:45 +0200, Jan Urbański wrote:

  put it in a file called selfuncs_ts.c so it is similar to the existing
  filename?
 
 I followed the pattern of ts_parse.c, ts_utils.c and so on.
 Also, I see geo_selfuncs.c. No big deal, though, I can move it.

No don't worry. You're right, the horse has already bolted.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-26 Thread Simon Riggs

On Thu, 2008-08-14 at 22:27 +0200, Jan Urbański wrote:
 Jan Urbański wrote:

 +  * ts_selfuncs.c

Not sure why this is in its own file, but if it must be could we please
put it in a file called selfuncs_ts.c so it is similar to the existing
filename?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-26 Thread Jan Urbański

Simon Riggs wrote:

On Thu, 2008-08-14 at 22:27 +0200, Jan Urbański wrote:

Jan Urbański wrote:



+  * ts_selfuncs.c


Not sure why this is in its own file


I couldn't decide where to put it, so I came up with this.


put it in a file called selfuncs_ts.c so it is similar to the existing
filename?


I followed the pattern of ts_parse.c, ts_utils.c and so on.
Also, I see geo_selfuncs.c. No big deal, though, I can move it.

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-26 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] writes:
 Simon Riggs wrote:
 put it in a file called selfuncs_ts.c so it is similar to the existing
 filename?

 I followed the pattern of ts_parse.c, ts_utils.c and so on.
 Also, I see geo_selfuncs.c. No big deal, though, I can move it.

Given the precedent of geo_selfuncs.c, I think you were right the
first time.  A more interesting question is whether it should just
get folded into selfuncs.c ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-14 Thread Jan Urbański

Heikki Linnakangas wrote:

Jan Urbański wrote:

26763 3.5451  AllocSetCheck


Make sure you disable assertions before profiling.


Awww, darn. OK, here goes another set of results, without casserts this 
time.


=== CVS HEAD ===

number of clients: 10
number of transactions per client: 10
number of transactions actually processed: 100/100
tps = 6437.286494 (including connections establishing)
tps = 6438.168927 (excluding connections establishing)

samples  %symbol name
220443   11.6613  AllocSetAlloc
79355 4.1978  base_yyparse
77230 4.0854  SearchCatCache
56011 2.9629  hash_search_with_hash_value
45946 2.4305  MemoryContextAllocZeroAligned
38577 2.0407  hash_any
36414 1.9263  MemoryContextAlloc
33060 1.7489  AllocSetFree
27218 1.4398  ScanKeywordLookup
25793 1.3644  base_yylex
20579 1.0886  hash_uint32
18867 0.9981  hash_seq_search
18293 0.9677  expression_tree_walker
17696 0.9361  copyObject
16979 0.8982  LockAcquire
14292 0.7560  MemoryContextAllocZero
13117 0.6939  SearchSysCache

=== ts_sel 

number of clients: 10
number of transactions per client: 10
number of transactions actually processed: 100/100
tps = 3216.753677 (including connections establishing)
tps = 3216.996592 (excluding connections establishing)

942096   10.9130  internal_text_pattern_compare
8091959.3735  bttext_pattern_cmp
6595457.6400  pg_detoast_datum_packed
6281147.2759  pg_qsort
6039986.9966  AllocSetAlloc
5818806.7403  pglz_decompress
4677085.4178  DirectFunctionCall2
3858544.4696  compare_two_textfreqs
1605781.8601  AllocSetFree
1286421.4902  swapfunc
1128851.3076  MemoryContextAlloc
1033881.1976  SearchCatCache
1003871.1629  text_to_cstring
99004 1.1468  hash_search_with_hash_value
98444 1.1403  .plt
92664 1.0734  base_yyparse
88511 1.0253  errstart

Not good... Shall I try sorting pg_statistics arrays on text values 
instead of frequencies?
BTW: I just noticed some text_to_cstring calls, they came from 
elog(DEBUG1)s that I have in my code. But they couldn't have skewn the 
results much, could they?


Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-14 Thread Heikki Linnakangas

Jan Urbański wrote:
Not good... Shall I try sorting pg_statistics arrays on text values 
instead of frequencies?


Yeah, I'd go with that. If you only do it for the new 
STATISTIC_KIND_MCV_ELEMENT statistics, you shouldn't need to change any 
other code.


Hmm. There has been discussion on raising default_statistic_target, and 
one reason why we've been afraid to do so has been that it increases the 
cost of planning (there's some O(n^2) algorithms in there). Pre-sorting 
the STATISTIC_KIND_MCV array as well, and replacing the linear searches 
with binary searches would alleviate that, which would be nice.


BTW: I just noticed some text_to_cstring calls, they came from 
elog(DEBUG1)s that I have in my code. But they couldn't have skewn the 
results much, could they?


Well, text_to_cstring was consuming 1.1% of the CPU time on its own, and 
presumably some of the AllocSetAlloc overhead is attributable to that as 
well. And perhaps some of the detoasting as well.


Speaking of which, a lot of time seems to be spent on detoasting. I'd 
like to understand that a better. Where is the detoasting coming from?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-14 Thread Jan Urbański

Heikki Linnakangas wrote:

Jan Urbański wrote:
Not good... Shall I try sorting pg_statistics arrays on text values 
instead of frequencies?


Yeah, I'd go with that. If you only do it for the new 
STATISTIC_KIND_MCV_ELEMENT statistics, you shouldn't need to change any 
other code.


OK, will do.

BTW: I just noticed some text_to_cstring calls, they came from 
elog(DEBUG1)s that I have in my code. But they couldn't have skewn the 
results much, could they?


Well, text_to_cstring was consuming 1.1% of the CPU time on its own, and 
presumably some of the AllocSetAlloc overhead is attributable to that as 
well. And perhaps some of the detoasting as well.


Speaking of which, a lot of time seems to be spent on detoasting. I'd 
like to understand that a better. Where is the detoasting coming from?


Hmm, maybe bttext_pattern_cmp does some detoasting? It calls 
PG_GETARG_TEXT_PP(), which in turn calls pg_detoast_datum_packed(). Oh, 
and also I think that compare_lexeme_textfreq() uses DatumGetTextP() and 
that also does detoasting. The root of all evil could by keeping a Datum 
in the TextFreq array, and not a text *, which is something you 
pointed out earlier and I apparently didn't understand.


So right now the idea is to:
 (1) pre-sort STATISTIC_KIND_MCELEM values
 (2) build an array of pointers to detoasted values in tssel()
 (3) use binary search when looking for MCELEMs during tsquery analysis

Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-14 Thread Heikki Linnakangas

Jan Urbański wrote:

So right now the idea is to:
 (1) pre-sort STATISTIC_KIND_MCELEM values
 (2) build an array of pointers to detoasted values in tssel()
 (3) use binary search when looking for MCELEMs during tsquery analysis


Sounds like a plan. In (2), it's even better to detoast the values 
lazily. For a typical one-word tsquery, the binary search will only look 
at a small portion of the elements.


Another thing is, how significant is the time spent in tssel() anyway, 
compared to actually running the query? You ran pgbench on EXPLAIN, 
which is good to see where in tssel() the time is spent, but if the time 
spent in tssel() is say 1% of the total execution time, there's no point 
optimizing it further.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-14 Thread Gregory Stark
Jan Urbański [EMAIL PROTECTED] writes:

 Heikki Linnakangas wrote:
 Speaking of which, a lot of time seems to be spent on detoasting. I'd like to
 understand that a better. Where is the detoasting coming from?

 Hmm, maybe bttext_pattern_cmp does some detoasting? It calls
 PG_GETARG_TEXT_PP(), which in turn calls pg_detoast_datum_packed(). Oh, and
 also I think that compare_lexeme_textfreq() uses DatumGetTextP() and that also
 does detoasting. 

DatumGetTextP() will detoast packed data (ie, 1-byte length headers) whereas
DatumGetTextPP will only detoast compressed or externally stored data. I
suspect you're seeing the former.

 The root of all evil could by keeping a Datum in the TextFreq array, and not
 a text *, which is something you pointed out earlier and I apparently
 didn't understand.

Well it doesn't really matter which type. If you store Datums which are
already detoasted then the DatumGetTextP and DatumGetTextPP will just be noops
anyways. If you store packed data (from DatumGetTextPP) then it's probably
safer to store it as Datums so if you need to pass it to any functions which
don't expect packed data they'll untoast it.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-14 Thread Jan Urbański

Heikki Linnakangas wrote:

Jan Urbański wrote:

So right now the idea is to:
 (1) pre-sort STATISTIC_KIND_MCELEM values
 (2) build an array of pointers to detoasted values in tssel()
 (3) use binary search when looking for MCELEMs during tsquery analysis


Sounds like a plan. In (2), it's even better to detoast the values 
lazily. For a typical one-word tsquery, the binary search will only look 
at a small portion of the elements.


Hm, how can I do that? Toast is still a bit black magic to me... Do you 
mean I should stick to having Datums in TextFreq? And use DatumGetTextP 
in bsearch() (assuming I'll get rid of qsort())? I wanted to avoid that, 
so I won't detoast the same value multiple times, but it's true: a 
binary search won't touch most elements.


Another thing is, how significant is the time spent in tssel() anyway, 
compared to actually running the query? You ran pgbench on EXPLAIN, 
which is good to see where in tssel() the time is spent, but if the time 
spent in tssel() is say 1% of the total execution time, there's no point 
optimizing it further.


Changed to the pgbench script to
select * from manual where tsvector @@ to_tsquery('foo');
and the parameters to
pgbench -n -f tssel-bench.sql -t 1000 postgres

and got

number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 12.238282 (including connections establishing)
tps = 12.238606 (excluding connections establishing)

samples  %symbol name
174731   31.6200  pglz_decompress
8810515.9438  tsvectorout
17280 3.1271  pg_mblen
13623 2.4653  AllocSetAlloc
13059 2.3632  hash_search_with_hash_value
10845 1.9626  pg_utf_mblen
10335 1.8703  internal_text_pattern_compare
9196  1.6641  index_getnext
9102  1.6471  bttext_pattern_cmp
8075  1.4613  pg_detoast_datum_packed
7437  1.3458  LWLockAcquire
7066  1.2787  hash_any
6811  1.2325  AllocSetFree
6623  1.1985  pg_qsort
6439  1.1652  LWLockRelease
5793  1.0483  DirectFunctionCall2
5322  0.9631  _bt_compare
4664  0.8440  tsCompareString
4636  0.8389  .plt
4539  0.8214  compare_two_textfreqs

But I think I'll go with pre-sorting anyway, it feels cleaner and neater.
--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-14 Thread Jan Urbański

Heikki Linnakangas wrote:

Jan Urbański wrote:

So right now the idea is to:
 (1) pre-sort STATISTIC_KIND_MCELEM values
 (2) build an array of pointers to detoasted values in tssel()
 (3) use binary search when looking for MCELEMs during tsquery analysis


Sounds like a plan. In (2), it's even better to detoast the values 
lazily. For a typical one-word tsquery, the binary search will only look 
at a small portion of the elements.


Here's another version.

Most common lexemes get sorted before storing in pg_statistic. The 
ordering is on length first and value second. That way we can avoid 
strncmp() calls when the lexemes have different lengths (and lexemes 
know their lengths, so the data is readily available). Also, in the 
binary search routine during selectivity estimation we can sometimes 
avoid detoasting (I think) Datums from the pg_statistic MCELEM array. 
See comments in code.


Pre-sorting introduced one problem (see XXX in code): it's not easy 
anymore to get the minimal frequency of MCELEM values. I was using it to 
assert that the selectivity of a tsquery node containing a lexeme not in 
MCELEM is no more that min(MCELEM freqs) / 2. That's only significant 
when the minimum frequency is less than DEFAULT_TS_SEL * 2, so I'm kind 
of inclined to ignore it and maybe drop a comment in the code that this 
may be a potential problem.


If nothing is fundamentally broken with this, I'll repeat my profiling 
tests to see if anything has been gained.


Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin
diff --git a/src/backend/tsearch/Makefile b/src/backend/tsearch/Makefile
index e20a4a2..ba728eb 100644
--- a/src/backend/tsearch/Makefile
+++ b/src/backend/tsearch/Makefile
@@ -19,7 +19,7 @@ DICTFILES=synonym_sample.syn thesaurus_sample.ths 
hunspell_sample.affix \
 OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o dict_thesaurus.o \
dict_ispell.o regis.o spell.o \
-   to_tsany.o ts_typanalyze.o ts_utils.o
+   to_tsany.o ts_typanalyze.o ts_selfuncs.o ts_utils.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/tsearch/ts_selfuncs.c 
b/src/backend/tsearch/ts_selfuncs.c
new file mode 100644
index 000..0fadc60
--- /dev/null
+++ b/src/backend/tsearch/ts_selfuncs.c
@@ -0,0 +1,320 @@
+/*-
+ *
+ * ts_selfuncs.c
+ *   Selectivity functions for text search types.
+ *
+ * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *   $PostgreSQL$
+ *
+ *-
+ */
+#include postgres.h
+
+#include miscadmin.h /* for check_stack_depth() */
+#include utils/memutils.h
+#include utils/builtins.h
+#include utils/syscache.h
+#include utils/lsyscache.h
+#include utils/selfuncs.h
+#include catalog/pg_type.h
+#include catalog/pg_statistic.h
+#include nodes/nodes.h
+#include tsearch/ts_type.h
+
+/* lookup table type for binary searching through MCELEMs */
+typedef struct
+{
+   Datum   element;
+   float4  frequency;
+} TextFreq;
+
+/* type of keys for bsearch()ing through an array of TextFreqs  */
+typedef struct
+{
+   char*lexeme;
+   int length;
+} LexemeKey;
+
+static int
+compare_lexeme_textfreq(const void *e1, const void *e2);
+
+static Selectivity
+tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup,
+ int length, float4 minfreq);
+static Selectivity
+mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem,
+  float4 *numbers, int 
nnumbers);
+static double
+tsquerysel(VariableStatData *vardata, Datum constval);
+
+
+/* TSQuery traversal function */
+static Selectivity
+tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup,
+ int length, float4 minfreq)
+{
+   LexemeKey   key;
+   TextFreq*searchres;
+   Selectivity s1, s2;
+
+   /* since this function recurses, it could be driven to stack overflow */
+   check_stack_depth();
+
+   if (item-type == QI_VAL)
+   {
+   QueryOperand *oper = (QueryOperand *) item;
+
+   /*
+* Prepare the key for bsearch().
+*/
+   key.lexeme = operand + oper-distance;
+   key.length = oper-length;
+
+   searchres = (TextFreq *) bsearch(key, lookup, length,
+   
 sizeof(TextFreq), compare_lexeme_textfreq);
+
+   if (searchres)
+   {
+   /*
+* The element is in MCELEM. Return precise selectivity 
(or at
+* least as precise, as ANALYZE could find out).
+*/
+   return (Selectivity) 

Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-14 Thread Jan Urbański

Jan Urbański wrote:

Heikki Linnakangas wrote:

Jan Urbański wrote:

So right now the idea is to:
 (1) pre-sort STATISTIC_KIND_MCELEM values
 (2) build an array of pointers to detoasted values in tssel()
 (3) use binary search when looking for MCELEMs during tsquery analysis


Sounds like a plan. In (2), it's even better to detoast the values 
lazily. For a typical one-word tsquery, the binary search will only 
look at a small portion of the elements.


Here's another version.


Context diff this time, always forget to convert them...

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin
*** a/src/backend/tsearch/Makefile
--- b/src/backend/tsearch/Makefile
***
*** 19,25  DICTFILES=synonym_sample.syn thesaurus_sample.ths 
hunspell_sample.affix \
  OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o dict_thesaurus.o \
dict_ispell.o regis.o spell.o \
!   to_tsany.o ts_typanalyze.o ts_utils.o
  
  include $(top_srcdir)/src/backend/common.mk
  
--- 19,25 
  OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o dict_thesaurus.o \
dict_ispell.o regis.o spell.o \
!   to_tsany.o ts_typanalyze.o ts_selfuncs.o ts_utils.o
  
  include $(top_srcdir)/src/backend/common.mk
  
*** /dev/null
--- b/src/backend/tsearch/ts_selfuncs.c
***
*** 0 
--- 1,320 
+ /*-
+  *
+  * ts_selfuncs.c
+  *  Selectivity functions for text search types.
+  *
+  * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group
+  *
+  *
+  * IDENTIFICATION
+  *  $PostgreSQL$
+  *
+  *-
+  */
+ #include postgres.h
+ 
+ #include miscadmin.h /* for check_stack_depth() */
+ #include utils/memutils.h
+ #include utils/builtins.h
+ #include utils/syscache.h
+ #include utils/lsyscache.h
+ #include utils/selfuncs.h
+ #include catalog/pg_type.h
+ #include catalog/pg_statistic.h
+ #include nodes/nodes.h
+ #include tsearch/ts_type.h
+ 
+ /* lookup table type for binary searching through MCELEMs */
+ typedef struct
+ {
+   Datum   element;
+   float4  frequency;
+ } TextFreq;
+ 
+ /* type of keys for bsearch()ing through an array of TextFreqs  */
+ typedef struct
+ {
+   char*lexeme;
+   int length;
+ } LexemeKey;
+ 
+ static int
+ compare_lexeme_textfreq(const void *e1, const void *e2);
+ 
+ static Selectivity
+ tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup,
+ int length, float4 minfreq);
+ static Selectivity
+ mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem,
+  float4 *numbers, int 
nnumbers);
+ static double
+ tsquerysel(VariableStatData *vardata, Datum constval);
+ 
+ 
+ /* TSQuery traversal function */
+ static Selectivity
+ tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup,
+ int length, float4 minfreq)
+ {
+   LexemeKey   key;
+   TextFreq*searchres;
+   Selectivity s1, s2;
+ 
+   /* since this function recurses, it could be driven to stack overflow */
+   check_stack_depth();
+ 
+   if (item-type == QI_VAL)
+   {
+   QueryOperand *oper = (QueryOperand *) item;
+ 
+   /*
+* Prepare the key for bsearch().
+*/
+   key.lexeme = operand + oper-distance;
+   key.length = oper-length;
+ 
+   searchres = (TextFreq *) bsearch(key, lookup, length,
+   
 sizeof(TextFreq), compare_lexeme_textfreq);
+ 
+   if (searchres)
+   {
+   /*
+* The element is in MCELEM. Return precise selectivity 
(or at
+* least as precise, as ANALYZE could find out).
+*/
+   return (Selectivity) searchres-frequency;
+   }
+   else
+   {
+   /*
+* The element is not in MCELEM. Punt, but assert that 
the
+* selectivity cannot be more than minfreq / 2.
+*/
+   return (Selectivity) Min(DEFAULT_TS_SEL, minfreq / 2);
+   }
+   }
+ 
+   /* Current TSQuery node is an operator */
+   switch (item-operator.oper)
+   {
+   case OP_NOT:
+   return 1.0 - tsquery_opr_selec(item + 1, operand, 
lookup,
+   
   length, minfreq);
+ 
+   case OP_AND:
+   return
+   tsquery_opr_selec(item + 1, operand, lookup, 
length, 

Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-14 Thread Alvaro Herrera
Jan Urbański wrote:
 Heikki Linnakangas wrote:

 Sounds like a plan. In (2), it's even better to detoast the values  
 lazily. For a typical one-word tsquery, the binary search will only 
 look at a small portion of the elements.

 Hm, how can I do that? Toast is still a bit black magic to me... Do you  
 mean I should stick to having Datums in TextFreq?

Store both the Datum and the text *.  If the latter is NULL, then grab
the datum, detoast and store the result in the text *.  Next time you
need to look at it, it's already detoasted.

I don't know the code so I have no idea if this is applicable.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-14 Thread Jan Urbański

Alvaro Herrera wrote:

Jan Urbański wrote:

Heikki Linnakangas wrote:


Sounds like a plan. In (2), it's even better to detoast the values  
lazily. For a typical one-word tsquery, the binary search will only 
look at a small portion of the elements.
Hm, how can I do that? Toast is still a bit black magic to me... Do you  
mean I should stick to having Datums in TextFreq?


Store both the Datum and the text *.  If the latter is NULL, then grab
the datum, detoast and store the result in the text *.  Next time you
need to look at it, it's already detoasted.


Yeah, I got that idea, but then I thought the chances of touching the 
same element during binary search twice were very small. Especially now 
when the detoasting occurs only when we hit a text Datum that has the 
same length as the sought lexeme.

Still, I can do it if people feel like it.

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-14 Thread Alvaro Herrera
Jan Urbański wrote:

 Yeah, I got that idea, but then I thought the chances of touching the  
 same element during binary search twice were very small. Especially now  
 when the detoasting occurs only when we hit a text Datum that has the  
 same length as the sought lexeme.
 Still, I can do it if people feel like it.

Actually, in that light it sounds pretty useless.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-13 Thread Jan Urbański

Heikki Linnakangas wrote:

Jan Urbański wrote:
through it. The only tiny ugliness is that there's one function used 
for qsort() and another for bsearch(), because I'm sorting an array of 
texts (from pg_statistic) and I'm binary searching for a lexeme 
(non-NULL terminated string with length).


It would be nice to clean that up a bit. I think you could convert the 
lexeme to a TextFreq, or make the TextFreq.element a text * instead of 
Datum (ie., detoast it with PG_DETOAST_DATUM while you build the array 
for qsort).


Hm, making it a text * won't help, I think, because the problem is:
 - pg_statistics holds an array of text datums and an array of float 
datums, ordered by the latter
 - a tsquery is a tree of WordEntries (lexemes), ie. non-NULL 
terminated strings


The approach was to:
 (1) create an array of (text, float) pairs by zipping the two 
pg_statistics arrays into one

 (2) sort that array on text values
 (3) every time a frequency of a WordEntry needs to be determined, look 
it up using binary search in the sorted array


So for (2) I needed a function that compares text with text (I reused 
bttext_pattern_cmp for that). And to do (3) I needed a function that 
compares text with WordEntries. I didn't want to convert WordEntries 
into text * values, because that would require a palloc().
Hmm, maybe I could build an array of (lexeme, float) in (1) instead, 
turning text * into a lexeme is very straightforward. Then I'd use the 
same function in (2) and (3) - cleaner.


However, maybe I should just skip all that qsort() nonsense and use a 
hashtable?
Another bold thought is: keep the pg_statistics arrays for tsvectors 
ordered by text datums, and not frequencies. Would've been awkward, 
'cause people expect the most common frequencies array to be sorted and 
not the most common values/elements one, but it'd help a lot and 
simplify the code quite a bit. It would induce one extra qsort() in 
ts_typanalyze(), but would allow only bsearch()es in tssel().



My medicore gprof skills got me:
[... nonsense ...]


I'd like to see a little bit more testing of that. I can't read gprof 
myself, so the above doesn't give me much confidence. I use oprofile, 
which I find is much simpler to use.
I think the worst case scenario is with statistics_target set to 
maximum, with a simplest possible query and simplest possible tsquery.


One kernel recompile later...
I got oprofile running, here's the setup:
$ pgbench -n -f tssel-bench.sql -c 10 -t 1000 postgres
And here's tssel-bench.sql:
explain select * from manuals where tsvector @@ to_tsquery('foo');

The manuals table was rather small (but that's irrelevant I think) and 
statistic_target for the tsvector column were set to 100.


Obviously foo() isn't a most common element in my test data, so the 
bsearch()es always miss. The results are:

samples  %symbol name
101507   13.4461  internal_text_pattern_compare
9139812.1070  bttext_pattern_cmp
8275310.9618  pg_detoast_datum_packed
66434 8.8001  pg_qsort
54005 7.1537  DirectFunctionCall2
48925 6.4808  pglz_decompress
44931 5.9518  compare_two_textfreqs
40178 5.3222  AllocSetAlloc
26763 3.5451  AllocSetCheck
20839 2.7604  AtEOXact_CatCache
16057 2.1270  AllocSetFree
13772 1.8243  swapfunc
10001 1.3248  .plt
7859  1.0410  text_to_cstring
7556  1.0009  datumCopy

So, maybe qsorting() every time you plan a query is not that cheap after 
all. I think hashing would also be an overkill. How do peole feel about 
storing the statistics sorted on values and not on frequencies?


Cheers,
Jan

PS:
I used a simple
$ opreport --symbols /path/to/postgres
are there any magic switches I need to add?

J

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-13 Thread Heikki Linnakangas

Jan Urbański wrote:

26763 3.5451  AllocSetCheck


Make sure you disable assertions before profiling. Although I'm actually
a bit surprised the overhead isn't more than 3.5%, I've seen much higher
overheads on other tests, but it's still skewing the results.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-08-11 Thread Heikki Linnakangas

Jan Urbański wrote:

Heikki Linnakangas wrote:

Jan Urbański wrote:
Another thing are cstring_to_text_with_len calls. I'm doing them so I 
can use bttextcmp in bsearch(). I think I could come up with a 
dedicated function to return text Datums and WordEntries (read: 
non-NULL terminated strings with a given length).


Just keep them as cstrings and use strcmp. We're only keeping the 
array sorted so that we can binary search it, so we don't need proper 
locale-dependent collation. Note that we already assume that two 
strings ('text's) are equal if and only if their binary 
representations are equal (texteq() uses strcmp).


OK, I got rid of cstring-text calls and memory contexts as I went 
through it. The only tiny ugliness is that there's one function used for 
qsort() and another for bsearch(), because I'm sorting an array of texts 
(from pg_statistic) and I'm binary searching for a lexeme (non-NULL 
terminated string with length).


It would be nice to clean that up a bit. I think you could convert the 
lexeme to a TextFreq, or make the TextFreq.element a text * instead of 
Datum (ie., detoast it with PG_DETOAST_DATUM while you build the array 
for qsort).



My medicore gprof skills got me:
0.000.22   5/5   OidFunctionCall4 [37]
[38]28.40.000.22   5 tssel [38]
0.000.17   5/5 get_restriction_variable [40]
0.030.01   5/10  pg_qsort [60]
0.000.00   5/5   get_attstatsslot [139]

Hopefully that says that the qsort() overhead is small compared to 
munging through the planner Node.


I'd like to see a little bit more testing of that. I can't read gprof 
myself, so the above doesn't give me much confidence. I use oprofile, 
which I find is much simpler to use.


I think the worst case scenario is with statistics_target set to 
maximum, with a simplest possible query and simplest possible tsquery.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-07-29 Thread Heikki Linnakangas

Jan Urbański wrote:
Another thing are cstring_to_text_with_len calls. I'm doing them so I 
can use bttextcmp in bsearch(). I think I could come up with a dedicated 
function to return text Datums and WordEntries (read: non-NULL 
terminated strings with a given length).


Just keep them as cstrings and use strcmp. We're only keeping the array 
sorted so that we can binary search it, so we don't need proper 
locale-dependent collation. Note that we already assume that two strings 
('text's) are equal if and only if their binary representations are 
equal (texteq() uses strcmp).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gsoc, oprrest function for text search take 2

2008-07-29 Thread Jan Urbański

Heikki Linnakangas wrote:

Jan Urbański wrote:
Another thing are cstring_to_text_with_len calls. I'm doing them so I 
can use bttextcmp in bsearch(). I think I could come up with a 
dedicated function to return text Datums and WordEntries (read: 
non-NULL terminated strings with a given length).


Just keep them as cstrings and use strcmp. We're only keeping the array 
sorted so that we can binary search it, so we don't need proper 
locale-dependent collation. Note that we already assume that two strings 
('text's) are equal if and only if their binary representations are 
equal (texteq() uses strcmp).


OK, I got rid of cstring-text calls and memory contexts as I went 
through it. The only tiny ugliness is that there's one function used for 
qsort() and another for bsearch(), because I'm sorting an array of texts 
(from pg_statistic) and I'm binary searching for a lexeme (non-NULL 
terminated string with length).

My medicore gprof skills got me:
0.000.22   5/5   OidFunctionCall4 [37]
[38]28.40.000.22   5 tssel [38]
0.000.17   5/5 
get_restriction_variable [40]

0.030.01   5/10  pg_qsort [60]
0.000.00   5/5   get_attstatsslot [139]

Hopefully that says that the qsort() overhead is small compared to 
munging through the planner Node.

Revised version of the patch attached.

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin
diff --git a/src/backend/tsearch/Makefile b/src/backend/tsearch/Makefile
index e20a4a2..ba728eb 100644
*** a/src/backend/tsearch/Makefile
--- b/src/backend/tsearch/Makefile
***
*** 19,25 
  OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o dict_thesaurus.o \
dict_ispell.o regis.o spell.o \
!   to_tsany.o ts_typanalyze.o ts_utils.o
  
  include $(top_srcdir)/src/backend/common.mk
  
--- 19,25 
  OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o dict_thesaurus.o \
dict_ispell.o regis.o spell.o \
!   to_tsany.o ts_typanalyze.o ts_selfuncs.o ts_utils.o
  
  include $(top_srcdir)/src/backend/common.mk
  
diff --git a/src/backend/tsearch/ts_selfuncs.c 
b/src/backend/tsearch/ts_selfuncs.c
new file mode 100644
index 000..4deb507
*** /dev/null
--- b/src/backend/tsearch/ts_selfuncs.c
***
*** 0,-1 
--- 1,334 
+ /*-
+  *
+  * ts_selfuncs.c
+  *  Selectivity functions for text search types.
+  *
+  * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group
+  *
+  *
+  * IDENTIFICATION
+  *  $PostgreSQL$
+  *
+  *-
+  */
+ #include postgres.h
+ 
+ #include miscadmin.h /* for check_stack_depth() */
+ #include utils/memutils.h
+ #include utils/builtins.h
+ #include utils/syscache.h
+ #include utils/lsyscache.h
+ #include utils/selfuncs.h
+ #include catalog/pg_type.h
+ #include catalog/pg_statistic.h
+ #include nodes/nodes.h
+ #include tsearch/ts_type.h
+ 
+ /* lookup table type for binary searching through MCELEMs */
+ typedef struct
+ {
+   Datum   element;
+   float4  frequency;
+ } TextFreq;
+ 
+ /* type of keys for bsearch()ing through an array of TextFreqs  */
+ typedef struct
+ {
+   char*lexeme;
+   int length;
+ } LexemeKey;
+ 
+ static int
+ compare_two_textfreqs(const void *e1, const void *e2);
+ static int
+ compare_lexeme_textfreq(const void *e1, const void *e2);
+ 
+ static Selectivity
+ tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup,
+ int length, float4 minfreq);
+ static Selectivity
+ mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem,
+  float4 *numbers, int 
nnumbers);
+ static double
+ tsquerysel(VariableStatData *vardata, Datum constval);
+ 
+ 
+ /* TSQuery traversal function */
+ static Selectivity
+ tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup,
+ int length, float4 minfreq)
+ {
+   LexemeKey   key;
+   TextFreq*searchres;
+   Selectivity s1, s2;
+ 
+   /* since this function recurses, it could be driven to stack overflow */
+   check_stack_depth();
+ 
+   if (item-type == QI_VAL)
+   {
+   QueryOperand *oper = (QueryOperand *) item;
+ 
+   /*
+* Prepare the key for bsearch().
+*/
+   key.lexeme = operand + oper-distance;
+   key.length = oper-length;
+ 
+   searchres = (TextFreq *) bsearch(key, lookup, length,
+   
 sizeof(TextFreq), compare_lexeme_textfreq);
+   if (searchres)
+  

[HACKERS] gsoc, oprrest function for text search take 2

2008-07-28 Thread Jan Urbański

Hi,

I know Commit Fest is in progress, as well as the holiday season. But 
the Summer of Code ends in about three weeks, so I'd like to request a 
bit of out-of-order processing :)


My previous mail sent to -hackers is here:
http://archives.postgresql.org/message-id/[EMAIL PROTECTED]

I had problems applying the patch from that mail (it got mangled 
somehow, could by my mail agent...), so I'm attaching it again.


There are two things that I'm not particularly proud of in the patch. 
First is palloc()ing and filling in a table simply to user qsort() on 
it. I guess I could either hack up a version of get_attstatsslot() that 
returns an array of (element, frequency) pairs or sort the elements by 
hand instead of using qsort() and keep the order of frequencies in sync 
while doing the sort.


Another thing are cstring_to_text_with_len calls. I'm doing them so I 
can use bttextcmp in bsearch(). I think I could come up with a dedicated 
function to return text Datums and WordEntries (read: non-NULL 
terminated strings with a given length).


Are these unsignificant? Or should I do these optimizations? Or, sadly, 
signs that using binary search is not a good decision?


Thanks,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

diff --git a/src/backend/tsearch/Makefile b/src/backend/tsearch/Makefile
index e20a4a2..ba728eb 100644
--- a/src/backend/tsearch/Makefile
+++ b/src/backend/tsearch/Makefile
@@ -19,7 +19,7 @@ DICTFILES=synonym_sample.syn thesaurus_sample.ths 
hunspell_sample.affix \
 OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
dict_simple.o dict_synonym.o dict_thesaurus.o \
dict_ispell.o regis.o spell.o \
-   to_tsany.o ts_typanalyze.o ts_utils.o
+   to_tsany.o ts_typanalyze.o ts_selfuncs.o ts_utils.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/tsearch/ts_selfuncs.c 
b/src/backend/tsearch/ts_selfuncs.c
new file mode 100644
index 000..4b86072
--- /dev/null
+++ b/src/backend/tsearch/ts_selfuncs.c
@@ -0,0 +1,299 @@
+/*-
+ *
+ * ts_selfuncs.c
+ *   Selectivity functions for text search types.
+ *
+ * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *   $PostgreSQL$
+ *
+ *-
+ */
+#include postgres.h
+
+#include miscadmin.h /* for check_stack_depth() */
+#include utils/memutils.h
+#include utils/builtins.h
+#include utils/syscache.h
+#include utils/lsyscache.h
+#include utils/selfuncs.h
+#include catalog/pg_type.h
+#include catalog/pg_statistic.h
+#include nodes/nodes.h
+#include tsearch/ts_type.h
+
+/* lookup table type for binary searching through MCELEMs */
+typedef struct
+{
+   Datum   element;
+   float4  frequency;
+} TextFreq;
+
+static int
+compare_textfreq(const void *e1, const void *e2);
+static Selectivity
+tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup,
+ int length, float4 minfreq);
+static Selectivity
+mcelem_tsquery_selec(TSQuery query, Datum *mcelem, int nmcelem,
+  float4 *numbers, int 
nnumbers);
+static double
+tsquerysel(VariableStatData *vardata, Datum constval);
+
+
+/* TSQuery traversal function */
+static Selectivity
+tsquery_opr_selec(QueryItem *item, char *operand, TextFreq *lookup,
+ int length, float4 minfreq)
+{
+   TextFreqkey,
+   *searchres;
+   Selectivity s1, s2;
+
+   /* since this function recurses, it could be driven to stack overflow */
+   check_stack_depth();
+
+   if (item-type == QI_VAL)
+   {
+   QueryOperand *oper = (QueryOperand *) item;
+
+   /*
+* Prepare the key for bsearch(). No need to initialize 
key.frequency,
+* because sorting is only on key.element.
+*/
+   key.element = PointerGetDatum(
+   cstring_to_text_with_len(operand + oper-distance, 
oper-length));
+
+   searchres = (TextFreq *) bsearch(key, lookup, length,
+   
 sizeof(TextFreq), compare_textfreq);
+   if (searchres)
+   {
+   /*
+* The element is in MCELEM. Return precise selectivity 
(or at
+* least as precise, as ANALYZE could find out).
+*/
+   return (Selectivity) searchres-frequency;
+   }
+   else
+   {
+   /*
+* The element is not in MCELEM. Punt, but  assure that 
the
+* selectivity cannot be more than minfreq / 2.
+*/
+   return