[HACKERS] text search patch status update?
Any status updates on the following patches? 1. Fragments in tsearch2 headlines: http://archives.postgresql.org/pgsql-hackers/2008-08/msg00043.php 2. Bug in hlCover: http://archives.postgresql.org/pgsql-hackers/2008-08/msg00089.php -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] text search patch status update?
Patch #1. Teodor was fine with the previous version of the patch. After that I modified it slightly to allow a FragmentDelimiter option and Teodor may have to look at that. Patch #2. I think this is a straigt forward bug fix. -Sushant. On Tue, Sep 16, 2008 at 11:27 AM, Alvaro Herrera [EMAIL PROTECTED] wrote: Sushant Sinha escribió: Any status updates on the following patches? 1. Fragments in tsearch2 headlines: http://archives.postgresql.org/pgsql-hackers/2008-08/msg00043.php 2. Bug in hlCover: http://archives.postgresql.org/pgsql-hackers/2008-08/msg00089.php Are these ready for review? If so, please add them to this commitfest, http://wiki.postgresql.org/wiki/CommitFest:2008-09 -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: [HACKERS] Very bad FTS performance with the Polish config
ts_headline calls ts_lexize equivalent to break the text. Off course there is algorithm to process the tokens and generate the headline. I would be really surprised if the algorithm to generate the headline is somehow dependent on language (as it only processes the tokens). So Oleg is right when he says ts_lexize is something to be checked. I will try to replicate what you are trying to do but in the meantime can you run the same ts_headline under psql multiple times and paste the result. -Sushant. 2009/11/19 Wojciech Knapik webmas...@wolniartysci.pl Oleg Bartunov wrote: Yes, for 4-word texts the results are similar. Try that with a longer text and the difference becomes more and more significant. For the lorem ipsum text, 'polish' is about 4 times slower, than 'english'. For 5 repetitions of the text, it's 6 times, for 10 repetitions - 7.5 times... Again, I see nothing unclear here, since dictionaries (as specified in configuration) apply to ALL words in document. The more words in document, the more overhead. You're missing the point. I'm not surprised that the function takes more time for larger input texts - that's obvious. The thing is, the computation times rise more steeply when the Polish config is used. Steeply enough, that the difference between the Polish and English configs becomes enormous in practical cases. Now this may be expected behaviour, but since I don't know if it is, I posted to the mailing lists to find out. If you're saying this is ok and there's nothing to fix here, then there's nothing more to discuss and we may consider the thread closed. If not, ts_headline deserves a closer look. cheers, Wojciech Knapik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] lexeme ordering in tsvector
It seems like the ordering of lexemes in tsvector has changed from 8.3 to 8.4. For example in 8.3.1, postgres=# select to_tsvector('english', 'quit everytime'); to_tsvector --- 'quit':1 'everytim':2 The lexemes are arranged by length and then by string comparison. In postgres 8.4.1, select to_tsvector('english', 'quit everytime'); to_tsvector --- 'everytim':2 'quit':1 they are arranged by strncmp and then by length. I looked in tsvector_op.c, in the function tsCompareString, first memcmp and then length comparison is done. Was this change in ordering deliberate? Wouldn't length comparison be cheaper than memcmp? -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline
Now I understand the code much better. A few more questions on headline generation that I was not able to get from the code: 1. Why is hlparsetext used to parse the document rather than the parsetext function? Since words to be included in the headline will be marked afterwords, it seems more reasonable to just use the parsetext function. The main difference I see is the use of hlfinditem and marking whether some word is repeated. The reason this is important is that hlparsetext does not seem to be storing word positions which parsetext does. The word positions are important for generating headline with fragments. 2. I would prefer the signature ts_headline( [regconfig,] text, tsquery [,text] )and function should accept 'NumFragments=N' for default parser. Another parsers may use another options. Does this mean we want a unified function ts_headline and we trigger the fragments if NumFragments is specified? It seems that introducing a new function which can take configuration OID, or name is complex as there are so many functions handling these issues in wparser.c. If this is true then we need to just add marking of headline words in prsd_headline. Otherwise we will need another prsd_headline_with_covers function. 3. In many cases people may already have TSVector for a given document (for search operation). Would it be faster to pass TSVector to headline function when compared to computing TSVector each time? If that is the case then should we have an option to pass TSVector to headline function? -Sushant. On Sat, 2008-05-24 at 07:57 +0400, Teodor Sigaev wrote: [moved to -hackers, because talk is about implementation details] I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1 (http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php) Thank you. 1 diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c now contrib/tsearch2 is compatibility layer for old applications - they don't know about new features. So, this part isn't needed. 2 solution to compile function (ts_headline_with_fragments) into core, but using it only from contrib module looks very odd. So, new feature can be used only with compatibility layer for old release :) 3 headline_with_fragments() is hardcoded to use default parser, but what will be in case when configuration uses another parser? For example, for japanese language. 4 I would prefer the signature ts_headline( [regconfig,] text, tsquery [,text] ) and function should accept 'NumFragments=N' for default parser. Another parsers may use another options. 5 it just doesn't work correctly, because new code doesn't care of parser specific type of lexemes. contrib_regression=# select headline_with_fragments('english', 'wow asd-wow wow', 'asd', ''); headline_with_fragments -- ...wow asd-wowbasd/b-wow wow (1 row) So, I incline to use existing framework/infrastructure although it may be a subject to change. Some description: 1 ts_headline defines a correct parser to use 2 it calls hlparsetext to split text into structure suitable for both goals: find the best fragment(s) and concatenate that fragment(s) back to the text representation 3 it calls parser specific method prsheadline which works with preparsed text (parse was done in hlparsetext). Method should mark a needed words/parts/lexemes etc. 4 ts_headline glues fragments into text and returns that. We need a parser's headline method because only parser knows all about its lexemes. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline
I have attached a new patch with respect to the current cvs head. This produces headline in a document for a given query. Basically it identifies fragments of text that contain the query and displays them. DESCRIPTION HeadlineParsedText contains an array of actual words but not information about the norms. We need an indexed position vector for each norm so that we can quickly evaluate a number of possible fragments. Something that tsvector provides. So this patch changes HeadlineParsedText to contain the norms (ParsedText). This field is updated while parsing in hlparsetext. The position information of the norms corresponds to the position of words in HeadlineParsedText (not to the norms positions as is the case in tsvector). This works correctly with the current parser. If you think there may be issues with other parsers please let me know. This approach does not change any other interface and fits nicely with the overall framework. The norms are converted into tsvector and a number of covers are generated. The best covers are then chosen to be in the headline. The covers are separated using a hardcoded coversep. Let me know if you want to expose this as an option. Covers that overlap with already chosen covers are excluded. Some options like ShortWord and MinWords are not taken care of right now. MaxWords are used as maxcoversize. Let me know if you would like to see other options for fragment generation as well. Let me know any more changes you would like to see. -Sushant. On Tue, 2008-05-27 at 13:30 +0400, Teodor Sigaev wrote: Hi! 1. Why is hlparsetext used to parse the document rather than the parsetext function? Since words to be included in the headline will be marked afterwords, it seems more reasonable to just use the parsetext function. The main difference I see is the use of hlfinditem and marking whether some word is repeated. hlparsetext preserves any kind of lexeme - not indexed, spaces etc. parsetext doesn't. hlparsetext preserves original form of lexemes. parsetext doesn't. The reason this is important is that hlparsetext does not seem to be storing word positions which parsetext does. The word positions are important for generating headline with fragments. Doesn't needed - hlparsetext preserves the whole text, so, position is a number of array. 2. I would prefer the signature ts_headline( [regconfig,] text, tsquery [,text] )and function should accept 'NumFragments=N' for default parser. Another parsers may use another options. Does this mean we want a unified function ts_headline and we trigger the fragments if NumFragments is specified? Trigger should be inside parser-specific function (pg_ts_parser.prsheadline). Another parsers might not recognize that option. It seems that introducing a new function which can take configuration OID, or name is complex as there are so many functions handling these issues in wparser.c. No, of course - ts_headline takes care about finding configuration and calling correct parser. If this is true then we need to just add marking of headline words in prsd_headline. Otherwise we will need another prsd_headline_with_covers function. Yeah, pg_ts_parser.prsheadline should mark the lexemes to. It even can change an array of HeadlineParsedText. 3. In many cases people may already have TSVector for a given document (for search operation). Would it be faster to pass TSVector to headline function when compared to computing TSVector each time? If that is the case then should we have an option to pass TSVector to headline function? As I mentioned above, tsvector doesn;t contain whole information about text. Index: src/backend/tsearch/dict.c === RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/dict.c,v retrieving revision 1.5 diff -u -r1.5 dict.c --- src/backend/tsearch/dict.c 25 Mar 2008 22:42:43 - 1.5 +++ src/backend/tsearch/dict.c 30 May 2008 23:20:57 - @@ -16,6 +16,7 @@ #include catalog/pg_type.h #include tsearch/ts_cache.h #include tsearch/ts_utils.h +#include tsearch/ts_public.h #include utils/builtins.h Index: src/backend/tsearch/to_tsany.c === RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/to_tsany.c,v retrieving revision 1.12 diff -u -r1.12 to_tsany.c --- src/backend/tsearch/to_tsany.c 16 May 2008 16:31:01 - 1.12 +++ src/backend/tsearch/to_tsany.c 31 May 2008 08:43:27 - @@ -15,6 +15,7 @@ #include catalog/namespace.h #include tsearch/ts_cache.h +#include tsearch/ts_public.h #include tsearch/ts_utils.h #include utils/builtins.h #include utils/syscache.h Index: src/backend/tsearch/ts_parse.c === RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/ts_parse.c,v retrieving
[HACKERS] phrase search
I have attached a patch for phrase search with respect to the cvs head. Basically it takes a a phrase (text) and a TSVector. It checks if the relative positions of lexeme in the phrase are same as in their positions in TSVector. If the configuration for text search is simple, then this will produce exact phrase search. Otherwise the stopwords in a phrase will be ignored and the words in a phrase will only be matched with the stemmed lexeme. For my application I am using this as a separate shared object. I do not know how to expose this function from the core. Can someone explain how to do this? I saw this discussion on phrase search and I am not sure what other functionality is required. http://archives.postgresql.org/pgsql-general/2008-02/msg01170.php -Sushant. Index: src/backend/utils/adt/Makefile === RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/utils/adt/Makefile,v retrieving revision 1.69 diff -u -r1.69 Makefile --- src/backend/utils/adt/Makefile 19 Feb 2008 10:30:08 - 1.69 +++ src/backend/utils/adt/Makefile 31 May 2008 19:57:34 - @@ -29,7 +29,7 @@ tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \ tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \ tsvector.o tsvector_op.o tsvector_parser.o \ - txid.o uuid.o xml.o + txid.o uuid.o xml.o phrase_search.o like.o: like.c like_match.c Index: src/backend/utils/adt/phrase_search.c === RCS file: src/backend/utils/adt/phrase_search.c diff -N src/backend/utils/adt/phrase_search.c --- /dev/null 1 Jan 1970 00:00:00 - +++ src/backend/utils/adt/phrase_search.c 31 May 2008 19:56:59 - @@ -0,0 +1,167 @@ +#include postgres.h + +#include tsearch/ts_type.h +#include tsearch/ts_utils.h + +#include fmgr.h + +#ifdef PG_MODULE_MAGIC +PG_MODULE_MAGIC; +#endif + +PG_FUNCTION_INFO_V1(is_phrase_present); +Datum is_phrase_present(PG_FUNCTION_ARGS); + +typedef struct { + WordEntryPosVector *posVector; + int4 posInPhrase; + int4 curpos; +} PhraseInfo; + +static int +WordCompareVectorEntry(char *eval, WordEntry *ptr, ParsedWord *prsdword) +{ + if (ptr-len == prsdword-len) + return strncmp( + eval + ptr-pos, + prsdword-word, + prsdword-len); + + return (ptr-len prsdword-len) ? 1 : -1; +} + +/* + * Returns a pointer to a WordEntry from tsvector t corresponding to prsdword. + * Returns NULL if not found. + */ +static WordEntry * +find_wordentry_prsdword(TSVector t, ParsedWord *prsdword) +{ + WordEntry *StopLow = ARRPTR(t); + WordEntry *StopHigh = (WordEntry *) STRPTR(t); + WordEntry *StopMiddle; + int difference; + + /* Loop invariant: StopLow = item StopHigh */ + + while (StopLow StopHigh) + { + StopMiddle = StopLow + (StopHigh - StopLow) / 2; + difference = WordCompareVectorEntry(STRPTR(t), StopMiddle, prsdword); + if (difference == 0) + return StopMiddle; + else if (difference 0) + StopLow = StopMiddle + 1; + else + StopHigh = StopMiddle; + } + + return NULL; +} + + +static int4 +check_and_advance(int4 i, PhraseInfo *phraseInfo) +{ + WordEntryPosVector *posvector1, *posvector2; + int4 diff; + + posvector1 = phraseInfo[i].posVector; +posvector2 = phraseInfo[i+1].posVector; + + diff = phraseInfo[i+1].posInPhrase - phraseInfo[i].posInPhrase; + while (posvector2-pos[phraseInfo[i+1].curpos] - posvector1-pos[phraseInfo[i].curpos] diff) + if (phraseInfo[i+1].curpos = posvector2-npos - 1) + return 2; + else + phraseInfo[i+1].curpos += 1; + + if (posvector2-pos[phraseInfo[i+1].curpos] - posvector1-pos[phraseInfo[i].curpos] == diff) + return 1; + else + return 0; +} + +int4 +initialize_phraseinfo(ParsedText *prs, TSVector t, PhraseInfo *phraseInfo) +{ + WordEntry *entry; + int4 i; + + for (i = 0; i prs-curwords; i++) + { + phraseInfo[i].posInPhrase = prs-words[i].pos.pos; + entry = find_wordentry_prsdword(t, (prs-words[i])); + if (entry == NULL) + return 0; + else + phraseInfo[i].posVector = _POSVECPTR(t, entry); + } + return 1; +} +Datum +is_phrase_present(PG_FUNCTION_ARGS) +{ + ParsedText prs; + int4 numwords, i, retval, found = 0; + PhraseInfo *phraseInfo; + text *phrase = PG_GETARG_TEXT_P(0); + TSVector t = PG_GETARG_TSVECTOR(1); +Oid cfgId = getTSCurrentConfig(true); + + prs.lenwords = (VARSIZE(phrase) - VARHDRSZ) / 6;/* just estimation of* word's number */ + if (prs.lenwords == 0) + prs.lenwords = 2; + prs.curwords = 0; + prs.pos = 0; + prs.words = (ParsedWord *) palloc0(sizeof(ParsedWord) * prs.lenwords); + + parsetext(cfgId, prs, VARDATA(phrase), VARSIZE(phrase) - VARHDRSZ); + + // allocate initialize + numwords = prs.curwords; + phraseInfo = palloc0(numwords * sizeof(PhraseInfo)); + + + if (numwords 0 initialize_phraseinfo(prs, t,
Re: [HACKERS] phrase search
On Mon, 2008-06-02 at 19:39 +0400, Teodor Sigaev wrote: I have attached a patch for phrase search with respect to the cvs head. Basically it takes a a phrase (text) and a TSVector. It checks if the relative positions of lexeme in the phrase are same as in their positions in TSVector. Ideally, phrase search should be implemented as new operator in tsquery, say # with optional distance. So, tsquery 'foo #2 bar' means: find all texts where 'bar' is place no far than two word from 'foo'. The complexity is about complex boolean expressions ( 'foo #1 ( bar1 bar2 )' ) and about several languages as norwegian or german. German language has combining words, like a footboolbar - and they have several variants of splitting, so result of to_tsquery('foo # footboolbar') will be a 'foo # ( ( football bar ) | ( foot ball bar ) )' where variants are connected with OR operation. This is far more complicated than I thought. Of course, phrase search should be able to use indexes. I can probably look into how to use index. Any pointers on this? If the configuration for text search is simple, then this will produce exact phrase search. Otherwise the stopwords in a phrase will be ignored and the words in a phrase will only be matched with the stemmed lexeme. Your solution can't be used as is, because user should use tsquery too to use an index: column @@ to_tsquery('phrase search') AND is_phrase_present('phrase search', column) First clause will be used for index scan and it will fast search a candidates. Yes this is exactly how I am using in my application. Do you think this will solve a lot of common case or we should try to get phrase search 1. Use index 2. Support arbitrary distance between lexemes 3. Support complex boolean queries -Sushant. For my application I am using this as a separate shared object. I do not know how to expose this function from the core. Can someone explain how to do this? Look at contrib/ directory in pgsql's source code - make a contrib module from your patch. As an example, look at adminpack module - it's rather simple. Comments of your code: 1) +#ifdef PG_MODULE_MAGIC +PG_MODULE_MAGIC; +#endif That isn't needed for compiled-in in core files, it's only needed for modules. 2) use only /**/ comments, do not use a // (C++ style) comments -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline
Efficiency: I realized that we do not need to store all norms. We need to only store store norms that are in the query. So I moved the addition of norms from addHLParsedLex to hlfinditem. This should add very little memory overhead to existing headline generation. If this is still not acceptable for default headline generation, then I can push it into mark_hl_fragments. But I think any headline marking function will benefit by having the norms corresponding to the query. Why we need norms? hlCover does the exact thing that Cover in tsrank does which is to find the cover that contains the query. However hlcover has to go through words that do not match the query. Cover on the other hand operates on position indexes for just the query words and so it should be faster. The main reason why I would I like it to be fast is that I want to generate all covers for a given query. Then choose covers with smallest length as they will be the one that will best explain relation of a query to a document. Finally stretch those covers to the specified size. In my understanding, the current headline generation tries to find the biggest cover for display in the headline. I personally think that such a cover does not explain the context of a query in a document. We may differ on this and thats why we may need both options. Let me know what you think on this patch and I will update the patch to respect other options like MinWords and ShortWord. NumFragments 2: I wanted people to use the new headline marker if they specify NumFragments = 1. If they do not specify the NumFragments or put it to 0 then the default marker will be used. This becomes a bit of tricky parameter so please put in any idea on how to trigger the new marker. On an another note I found that make_tsvector crashes if it receives a ParsedText with curwords = 0. Specifically uniqueWORD returns curwords as 1 even when it gets 0 words. I am not sure if this is the desired behavior. -Sushant. On Mon, 2008-06-02 at 18:10 +0400, Teodor Sigaev wrote: I have attached a new patch with respect to the current cvs head. This produces headline in a document for a given query. Basically it identifies fragments of text that contain the query and displays them. New variant is much better, but... HeadlineParsedText contains an array of actual words but not information about the norms. We need an indexed position vector for each norm so that we can quickly evaluate a number of possible fragments. Something that tsvector provides. Why do you need to store norms? The single purpose of norms is identifying words from query - but it's already done by hlfinditem. It sets HeadlineWordEntry-item to corresponding QueryOperand in tsquery. Look, headline function is rather expensive and your patch adds a lot of extra work - at least in memory usage. And if user calls with NumFragments=0 the that work is unneeded. This approach does not change any other interface and fits nicely with the overall framework. Yeah, it's a really big step forward. Thank you. You are very close to committing except: Did you find a hlCover() function which produce a cover from original HeadlineParsedText representation? Is any reason to do not use it? The norms are converted into tsvector and a number of covers are generated. The best covers are then chosen to be in the headline. The covers are separated using a hardcoded coversep. Let me know if you want to expose this as an option. Covers that overlap with already chosen covers are excluded. Some options like ShortWord and MinWords are not taken care of right now. MaxWords are used as maxcoversize. Let me know if you would like to see other options for fragment generation as well. ShortWord, MinWords and MaxWords should store their meaning, but for each fragment, not for the whole headline. Let me know any more changes you would like to see. if (num_fragments == 0) /* call the default headline generator */ mark_hl_words(prs, query, highlight, shortword, min_words, max_words); else mark_hl_fragments(prs, query, highlight, num_fragments, max_words); Suppose, num_fragments 2? Index: src/backend/tsearch/dict.c === RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/dict.c,v retrieving revision 1.5 diff -u -r1.5 dict.c --- src/backend/tsearch/dict.c 25 Mar 2008 22:42:43 - 1.5 +++ src/backend/tsearch/dict.c 30 May 2008 23:20:57 - @@ -16,6 +16,7 @@ #include catalog/pg_type.h #include tsearch/ts_cache.h #include tsearch/ts_utils.h +#include tsearch/ts_public.h #include utils/builtins.h Index: src/backend/tsearch/to_tsany.c === RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/to_tsany.c,v retrieving revision 1.12
Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline
My main argument for using Cover instead of hlCover was that Cover will be faster. I tested the default headline generation that uses hlCover with the current patch that uses Cover. There was not much difference. So I think you are right in that we do not need norms and we can just use hlCover. I also compared performance of ts_headline with my first patch to headline generation (one which was a separate function and took tsvector as input). The performance was dramatically different. For one query ts_headline took roughly 200 ms while headline_with_fragments took just 70 ms. On an another query ts_headline took 76 ms while headline_with_fragments took 24 ms. You can find 'explain analyze' for the first query at the bottom of the page. These queries were run multiple times to ensure that I never hit the disk. This is a m/c with 2.0 GhZ Pentium 4 CPU and 512 MB RAM running Linux 2.6.22-gentoo-r8. A couple of caveats: 1. ts_headline testing was done with current cvs head where as headline_with_fragments was done with postgres 8.3.1. 2. For headline_with_fragments, TSVector for the document was obtained by joining with another table. Are these differences understandable? If you think these caveats are the reasons or there is something I am missing, then I can repeat the entire experiments with exactly the same conditions. -Sushant. Here is 'explain analyze' for both the functions: ts_headline lawdb=# explain analyze SELECT ts_headline('english', doc, q, '') FROMdocraw, plainto_tsquery('english', 'freedom of speech') as q WHERE docraw.tid = 125596; QUERY PLAN Nested Loop (cost=0.00..8.31 rows=1 width=497) (actual time=199.692..200.207 rows=1 loops=1) - Index Scan using docraw_pkey on docraw (cost=0.00..8.29 rows=1 width=465) (actual time=0.041..0.065 rows=1 loops=1) Index Cond: (tid = 125596) - Function Scan on q (cost=0.00..0.01 rows=1 width=32) (actual time=0.010..0.014 rows=1 loops=1) Total runtime: 200.311 ms headline_with_fragments --- lawdb=# explain analyze SELECT headline_with_fragments('english', docvector, doc, q, 'MaxWords=40') FROMdocraw, docmeta, plainto_tsquery('english', 'freedom of speech') as q WHERE docraw.tid = 125596 and docmeta.tid=125596; QUERY PLAN -- Nested Loop (cost=0.00..16.61 rows=1 width=883) (actual time=70.564..70.949 rows=1 loops=1) - Nested Loop (cost=0.00..16.59 rows=1 width=851) (actual time=0.064..0.094 rows=1 loops=1) - Index Scan using docraw_pkey on docraw (cost=0.00..8.29 rows=1 width=454) (actual time=0.040..0.044 rows=1 loops=1) Index Cond: (tid = 125596) - Index Scan using docmeta_pkey on docmeta (cost=0.00..8.29 rows=1 width=397) (actual time=0.017..0.040 rows=1 loops=1) Index Cond: (docmeta.tid = 125596) - Function Scan on q (cost=0.00..0.01 rows=1 width=32) (actual time=0.012..0.016 rows=1 loops=1) Total runtime: 71.076 ms (8 rows) On Tue, 2008-06-03 at 22:53 +0400, Teodor Sigaev wrote: Why we need norms? We don't need norms at all - all matched HeadlineWordEntry already marked by HeadlineWordEntry-item! If it equals to NULL then this word isn't contained in tsquery. hlCover does the exact thing that Cover in tsrank does which is to find the cover that contains the query. However hlcover has to go through words that do not match the query. Cover on the other hand operates on position indexes for just the query words and so it should be faster. Cover, by definition, is a minimal continuous text's piece matched by query. May be a several covers in text and hlCover will find all of them. Next, prsd_headline() (for now) tries to define the best one. Best means: cover contains a lot of words from query, not less that MinWords, not greater than MaxWords, hasn't words shorter that ShortWord on the begin and end of cover etc. The main reason why I would I like it to be fast is that I want to generate all covers for a given query. Then choose covers with smallest hlCover generates all covers. Let me know what you think on this patch and I will update the patch to respect other options like MinWords and ShortWord. As I understand, you very wish to call Cover() function instead of hlCover() - by design, they should be identical, but accepts different document's representation. So, the best way is generalize them: develop a new one which can be called with some kind of callback or/and opaque structure to use it in both rank and headline. NumFragments 2: I wanted people to use the new headline marker if they specify NumFragments = 1. If they do not
Re: [HACKERS] phrase search
On Tue, 2008-06-03 at 22:16 +0400, Teodor Sigaev wrote: This is far more complicated than I thought. Of course, phrase search should be able to use indexes. I can probably look into how to use index. Any pointers on this? src/backend/utils/adt/tsginidx.c, if you invent operation # in tsquery then you will have index support with minimal effort. Yes this is exactly how I am using in my application. Do you think this will solve a lot of common case or we should try to get phrase search Yeah, it solves a lot of useful case, for simple use it's needed to invent function similar to existsing plaitnto_tsquery, say phraseto_tsquery. It should produce correct tsquery with described above operations. I can add index support and support for arbitrary distance between lexeme. It appears to me that supporting arbitrary boolean expression will be complicated. Can we pull out something from TSQuery? -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline
I have an attached an updated patch with following changes: 1. Respects ShortWord and MinWords 2. Uses hlCover instead of Cover 3. Does not store norm (or lexeme) for headline marking 4. Removes ts_rank.h 5. Earlier it was counting even NONWORDTOKEN in the headline. Now it only counts the actual words and excludes spaces etc. I have also changed NumFragments option to MaxFragments as there may not be enough covers to display NumFragments. Another change that I was thinking: Right now if cover size max_words then I just cut the trailing words. Instead I was thinking that we should split the cover into more fragments such that each fragment contains a few query words. Then each fragment will not contain all query words but will show more occurrences of query words in the headline. I would like to know what your opinion on this is. -Sushant. On Thu, 2008-06-05 at 20:21 +0400, Teodor Sigaev wrote: A couple of caveats: 1. ts_headline testing was done with current cvs head where as headline_with_fragments was done with postgres 8.3.1. 2. For headline_with_fragments, TSVector for the document was obtained by joining with another table. Are these differences understandable? That is possible situation because ts_headline has several criterias of 'best' covers - length, number of words from query, good words at the begin and at the end of headline while your fragment's algorithm takes care only on total number of words in all covers. It's not very good, but it's acceptable, I think. Headline (and ranking too) hasn't any formal rules to define is it good or bad? Just a people's opinions. Next possible reason: original algorithm had a look on all covers trying to find the best one while your algorithm tries to find just the shortest covers to fill a headline. But it's very desirable to use ShortWord - it's not very comfortable for user if one option produces unobvious side effect with another one. ` If you think these caveats are the reasons or there is something I am missing, then I can repeat the entire experiments with exactly the same conditions. Interesting for me test is a comparing hlCover with Cover in your patch, i.e. develop a patch which uses hlCover instead of Cover and compare old patch with new one. Index: src/backend/tsearch/wparser_def.c === RCS file: /home/sushant/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v retrieving revision 1.14 diff -c -r1.14 wparser_def.c *** src/backend/tsearch/wparser_def.c 1 Jan 2008 19:45:52 - 1.14 --- src/backend/tsearch/wparser_def.c 21 Jun 2008 07:59:02 - *** *** 1684,1701 return false; } ! Datum ! prsd_headline(PG_FUNCTION_ARGS) { ! HeadlineParsedText *prs = (HeadlineParsedText *) PG_GETARG_POINTER(0); ! List *prsoptions = (List *) PG_GETARG_POINTER(1); ! TSQuery query = PG_GETARG_TSQUERY(2); ! /* from opt + start and and tag */ ! int min_words = 15; ! int max_words = 35; ! int shortword = 3; int p = 0, q = 0; int bestb = -1, --- 1684,1891 return false; } ! static void ! mark_fragment(HeadlineParsedText *prs, int highlight, int startpos, int endpos) { ! int i; ! char *coversep = ...; ! int coverlen = strlen(coversep); ! for (i = startpos; i = endpos; i++) ! { ! if (prs-words[i].item) ! prs-words[i].selected = 1; ! if (highlight == 0) ! { ! if (HLIDIGNORE(prs-words[i].type)) ! prs-words[i].replace = 1; ! } ! else ! { ! if (XMLHLIDIGNORE(prs-words[i].type)) ! prs-words[i].replace = 1; ! } ! ! prs-words[i].in = (prs-words[i].repeated) ? 0 : 1; ! } ! /* add cover separators if needed */ ! if (startpos 0 strncmp(prs-words[startpos-1].word, coversep, ! prs-words[startpos-1].len) != 0) ! { ! ! prs-words[startpos-1].word = repalloc(prs-words[startpos-1].word, sizeof(char) * coverlen); ! prs-words[startpos-1].in = 1; ! prs-words[startpos-1].len = coverlen; ! memcpy(prs-words[startpos-1].word, coversep, coverlen); ! } ! if (endpos-1 prs-curwords strncmp(prs-words[startpos-1].word, coversep, ! prs-words[startpos-1].len) != 0) ! { ! prs-words[endpos+1].word = repalloc(prs-words[endpos+1].word, sizeof(char) * coverlen); ! prs-words[endpos+1].in = 1; ! memcpy(prs-words[endpos+1].word, coversep, coverlen); ! } ! } ! ! typedef struct ! { ! int4 startpos; ! int4 endpos; ! int2 in; ! int2 excluded; ! } CoverPos; ! ! ! static void ! mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, int highlight, ! int shortword, int min_words, ! int max_words, int max_fragments) ! { ! int4 curlen, coverlen, i, f, num_f; ! int4 stretch, maxstretch; ! ! int4 startpos = 0, ! endpos = 0, ! p= 0, ! q= 0; ! ! int4 numcovers = 0, ! maxcovers = 32; ! ! int4
[HACKERS] initdb in current cvs head broken?
I am trying to generate a patch with respect to the current CVS head. So ai rsynced the tree, then did cvs up and installed the db. However, when I did initdb on a data directory it is stuck: It is stuck after printing creating template1 creating template1 database in /home/postgres/data/base/1 ... I did strace $ strace -p 9852 Process 9852 attached - interrupt to quit waitpid(9864, then I straced 9864 $ strace -p 9864 Process 9864 attached - interrupt to quit semop(8060958, 0xbff36fee, $ ps aux|grep 9864 postgres 9864 1.5 1.3 37296 6816 pts/1S+ 07:51 0:02 /usr/local/pgsql/bin/postgres --boot -x1 -F Seems like a bug to me. Is the tree stable only after commit fests and I should not use the unstable tree for generating patches? Thanks, -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb in current cvs head broken?
You are right. I did not do make clean last time. After make clean, make all, and make install it works fine. -Sushant. On Thu, 2008-07-10 at 17:55 +0530, Pavan Deolasee wrote: On Thu, Jul 10, 2008 at 5:36 PM, Sushant Sinha [EMAIL PROTECTED] wrote: Seems like a bug to me. Is the tree stable only after commit fests and I should not use the unstable tree for generating patches? I quickly tried on my repo and its working fine. (Well it could be a bit out of sync with the head). Usually, the tree may get a bit inconsistent during the active period, but its not very common. I've seen committers doing a good job before checking in any code and making sure it works fine (atleast initdb and regression tests). I would suggest doing a clean build at your end once again. Thanks, Pavan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline
Attached a new patch that: 1. fixes previous bug 2. better handles the case when cover size is greater than the MaxWords. Basically it divides a cover greater than MaxWords into fragments of MaxWords, resizes each such fragment so that each end of the fragment contains a query word and then evaluates best fragments based on number of query words in each fragment. In case of tie it picks up the smaller fragment. This allows more query words to be shown with multiple fragments in case a single cover is larger than the MaxWords. The resizing of a fragment such that each end is a query word provides room for stretching both sides of the fragment. This (hopefully) better presents the context in which query words appear in the document. If a cover is smaller than MaxWords then the cover is treated as a fragment. Let me know if you have any more suggestions or anything is not clear. I have not yet added the regression tests. The regression test suite seemed to be only ensuring that the function works. How many tests should I be adding? Is there any other place that I need to add different test cases for the function? -Sushant. Nice. But it will be good to resolve following issues: 1) Patch contains mistakes, I didn't investigate or carefully read it. Get http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gzhttp://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gzand load in db. Queries # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') from apod where to_tsvector(body) @@ plainto_tsquery('black hole'); and # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') from apod; crash postgresql :( 2) pls, include in your patch documentation and regression tests. Another change that I was thinking: Right now if cover size max_words then I just cut the trailing words. Instead I was thinking that we should split the cover into more fragments such that each fragment contains a few query words. Then each fragment will not contain all query words but will show more occurrences of query words in the headline. I would like to know what your opinion on this is. Agreed. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ Index: src/backend/tsearch/wparser_def.c === RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v retrieving revision 1.15 diff -c -r1.15 wparser_def.c *** src/backend/tsearch/wparser_def.c 17 Jun 2008 16:09:06 - 1.15 --- src/backend/tsearch/wparser_def.c 15 Jul 2008 04:30:34 - *** *** 1684,1701 return false; } ! Datum ! prsd_headline(PG_FUNCTION_ARGS) { ! HeadlineParsedText *prs = (HeadlineParsedText *) PG_GETARG_POINTER(0); ! List *prsoptions = (List *) PG_GETARG_POINTER(1); ! TSQuery query = PG_GETARG_TSQUERY(2); ! /* from opt + start and and tag */ ! int min_words = 15; ! int max_words = 35; ! int shortword = 3; int p = 0, q = 0; int bestb = -1, --- 1684,1944 return false; } ! static void ! mark_fragment(HeadlineParsedText *prs, int highlight, int startpos, int endpos) { ! int i; ! char *coversep = ... ; ! int seplen = strlen(coversep); ! for (i = startpos; i = endpos; i++) ! { ! if (prs-words[i].item) ! prs-words[i].selected = 1; ! if (highlight == 0) ! { ! if (HLIDIGNORE(prs-words[i].type)) ! prs-words[i].replace = 1; ! } ! else ! { ! if (XMLHLIDIGNORE(prs-words[i].type)) ! prs-words[i].replace = 1; ! } ! ! prs-words[i].in = (prs-words[i].repeated) ? 0 : 1; ! } ! /* add cover separators if needed */ ! if (startpos 0) ! { ! ! prs-words[startpos-1].word = repalloc(prs-words[startpos-1].word, sizeof(char) * seplen); ! prs-words[startpos-1].in = 1; ! prs-words[startpos-1].len = seplen; ! memcpy(prs-words[startpos-1].word, coversep, seplen); ! } ! } ! ! typedef struct ! { ! int4 startpos; ! int4 endpos; ! int4 poslen; ! int4 curlen; ! int2 in; ! int2 excluded; ! } CoverPos; ! ! static void ! get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos, ! int *curlen, int *poslen, int max_words) ! { ! int i; ! /* Objective: Generate a fragment of words between startpos and endpos ! * such that it has at most max_words and both ends has query words. ! * If the startpos and endpos are the endpoints of the cover and the ! * cover has fewer words than max_words, then this function should ! * just return the cover ! */ ! /* first move startpos to an item */ ! for(i = *startpos; i = *endpos; i++) ! { ! *startpos = i; ! if (prs-words[i].item !prs-words[i].repeated) ! break; ! } ! /* cut endpos to have only max_words */ ! *curlen = 0; ! *poslen = 0; ! for(i = *startpos; i = *endpos *curlen max_words; i++) ! { !
Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline
attached are two patches: 1. documentation 2. regression tests for headline with fragments. -Sushant. On Tue, 2008-07-15 at 13:29 +0400, Teodor Sigaev wrote: Attached a new patch that: 1. fixes previous bug 2. better handles the case when cover size is greater than the MaxWords. Looks good, I'll make some tests with real-world application. I have not yet added the regression tests. The regression test suite seemed to be only ensuring that the function works. How many tests should I be adding? Is there any other place that I need to add different test cases for the function? Just add 3-5 selects to src/test/regress/sql/tsearch.sql with checking basic functionality and corner cases like - there is no covers in text - Cover(s) is too big - and so on Add some words in documentation too, pls. Index: doc/src/sgml/textsearch.sgml === RCS file: /home/postgres/devel/pgsql-cvs/pgsql/doc/src/sgml/textsearch.sgml,v retrieving revision 1.44 diff -c -r1.44 textsearch.sgml *** doc/src/sgml/textsearch.sgml 16 May 2008 16:31:01 - 1.44 --- doc/src/sgml/textsearch.sgml 16 Jul 2008 02:37:28 - *** *** 1100,1105 --- 1100,1117 /listitem listitem para +literalMaxFragments/literal: maximum number of text excerpts +or fragments that matches the query words. It also triggers a +different headline generation function than the default one. This +function finds text fragments with as many query words as possible. +Each fragment will be of at most MaxWords and will not have words +of size less than or equal to ShortWord at the start or end of a +fragment. If all query words are not found in the document, then +a single fragment of MinWords will be displayed. + /para + /listitem + listitem + para literalHighlightAll/literal: Boolean flag; if literaltrue/literal the whole document will be highlighted. /para *** *** 1109,1115 Any unspecified options receive these defaults: programlisting ! StartSel=lt;bgt;, StopSel=lt;/bgt;, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE /programlisting /para --- 1121,1127 Any unspecified options receive these defaults: programlisting ! StartSel=lt;bgt;, StopSel=lt;/bgt;, MaxFragments=0, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE /programlisting /para Index: src/test/regress/sql/tsearch.sql === RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/test/regress/sql/tsearch.sql,v retrieving revision 1.9 diff -c -r1.9 tsearch.sql *** src/test/regress/sql/tsearch.sql 16 May 2008 16:31:02 - 1.9 --- src/test/regress/sql/tsearch.sql 16 Jul 2008 03:45:24 - *** *** 208,213 --- 208,253 /html', to_tsquery('english', 'seafoo'), 'HighlightAll=true'); + --Check if headline fragments work + SELECT ts_headline('english', ' + Day after day, day after day, + We stuck, nor breath nor motion, + As idle as a painted Ship + Upon a painted Ocean. + Water, water, every where + And all the boards did shrink; + Water, water, every where, + Nor any drop to drink. + S. T. Coleridge (1772-1834) + ', to_tsquery('english', 'ocean'), 'MaxFragments=1'); + + --Check if more than one fragments are displayed + SELECT ts_headline('english', ' + Day after day, day after day, + We stuck, nor breath nor motion, + As idle as a painted Ship + Upon a painted Ocean. + Water, water, every where + And all the boards did shrink; + Water, water, every where, + Nor any drop to drink. + S. T. Coleridge (1772-1834) + ', to_tsquery('english', 'Coleridge stuck'), 'MaxFragments=2'); + + --Fragments when there all query words are not in the document + SELECT ts_headline('english', ' + Day after day, day after day, + We stuck, nor breath nor motion, + As idle as a painted Ship + Upon a painted Ocean. + Water, water, every where + And all the boards did shrink; + Water, water, every where, + Nor any drop to drink. + S. T. Coleridge (1772-1834) + ', to_tsquery('english', 'ocean seahorse'), 'MaxFragments=1'); + + --Rewrite sub system CREATE TABLE test_tsquery (txtkeyword TEXT, txtsample TEXT); Index: src/test/regress/expected/tsearch.out === RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/test/regress/expected/tsearch.out,v retrieving revision 1.14 diff -c -r1.14 tsearch.out *** src/test/regress/expected/tsearch.out 16 May 2008 16:31:02 - 1.14 --- src/test/regress/expected/tsearch.out 16 Jul 2008 03:47:46 - *** *** 632,637 --- 632,705 /html (1 row) + --Check if headline fragments work + SELECT ts_headline('english', ' + Day after day, day after day, + We
Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline
I will add test queries and their results for the corner cases in a separate file. I guess the only thing I am confused about is what should be the behavior of headline generation when Query items have words of size less than ShortWord. I guess the answer is to ignore ShortWord parameter but let me know if the answer is any different. -Sushant. On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote: Sushant, first, please, provide simple test queries, which demonstrate the right work in the corner cases. This will helps reviewers to test your patch and helps you to make sure your new version is ok. For example: =# select ts_headline('1 2 3 4 5 1 2 3 1','13'::tsquery); ts_headline -- b1/b 2 b3/b 4 5 b1/b 2 b3/b b1/b This select breaks your code: =# select ts_headline('1 2 3 4 5 1 2 3 1','13'::tsquery,'maxfragments=2'); ts_headline -- ... 2 ... and so on Oleg On Tue, 15 Jul 2008, Sushant Sinha wrote: Attached a new patch that: 1. fixes previous bug 2. better handles the case when cover size is greater than the MaxWords. Basically it divides a cover greater than MaxWords into fragments of MaxWords, resizes each such fragment so that each end of the fragment contains a query word and then evaluates best fragments based on number of query words in each fragment. In case of tie it picks up the smaller fragment. This allows more query words to be shown with multiple fragments in case a single cover is larger than the MaxWords. The resizing of a fragment such that each end is a query word provides room for stretching both sides of the fragment. This (hopefully) better presents the context in which query words appear in the document. If a cover is smaller than MaxWords then the cover is treated as a fragment. Let me know if you have any more suggestions or anything is not clear. I have not yet added the regression tests. The regression test suite seemed to be only ensuring that the function works. How many tests should I be adding? Is there any other place that I need to add different test cases for the function? -Sushant. Nice. But it will be good to resolve following issues: 1) Patch contains mistakes, I didn't investigate or carefully read it. Get http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gzhttp://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gzand load in db. Queries # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') from apod where to_tsvector(body) @@ plainto_tsquery('black hole'); and # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') from apod; crash postgresql :( 2) pls, include in your patch documentation and regression tests. Another change that I was thinking: Right now if cover size max_words then I just cut the trailing words. Instead I was thinking that we should split the cover into more fragments such that each fragment contains a few query words. Then each fragment will not contain all query words but will show more occurrences of query words in the headline. I would like to know what your opinion on this is. Agreed. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] small bug in hlCover
I think there is a slight bug in hlCover function in wparser_def.c If there is only one query item and that is the first word in the text, then hlCover does not returns any cover. This is evident in this example when ts_headline only generates the min_words: testdb=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery, 'MinWords=5'); ts_headline -- b1/b 2 3 4 5 (1 row) The problem is that *q is initialized to 0 which is a legitimate value for a cover. So I have attached a patch that fixes it and after applying the patch here is the result. testdb=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery, 'MinWords=5'); ts_headline - b1/b 2 3 4 5 6 7 8 9 10 (1 row) -Sushant. Index: src/backend/tsearch/wparser_def.c === RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v retrieving revision 1.15 diff -c -r1.15 wparser_def.c *** src/backend/tsearch/wparser_def.c 17 Jun 2008 16:09:06 - 1.15 --- src/backend/tsearch/wparser_def.c 17 Jul 2008 02:45:34 - *** *** 1621,1627 QueryItem *item = GETQUERY(query); int pos = *p; ! *q = 0; *p = 0x7fff; for (j = 0; j query-size; j++) --- 1621,1627 QueryItem *item = GETQUERY(query); int pos = *p; ! *q = -1; *p = 0x7fff; for (j = 0; j query-size; j++) *** *** 1643,1649 item++; } ! if (*q == 0) return false; item = GETQUERY(query); --- 1643,1649 item++; } ! if (*q 0) return false; item = GETQUERY(query); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline
Fixed some off by one errors pointed by Oleg and errors in excluding overlapping fragments. Also adding test queries and updating regression tests. Let me know of any other changes that are needed. -Sushant. On Thu, 2008-07-17 at 03:28 +0400, Oleg Bartunov wrote: On Wed, 16 Jul 2008, Sushant Sinha wrote: I will add test queries and their results for the corner cases in a separate file. I guess the only thing I am confused about is what should be the behavior of headline generation when Query items have words of size less than ShortWord. I guess the answer is to ignore ShortWord parameter but let me know if the answer is any different. ShortWord is about headline text, it doesn't affects words in query, so you can't discard them from query. -Sushant. On Thu, 2008-07-17 at 02:53 +0400, Oleg Bartunov wrote: Sushant, first, please, provide simple test queries, which demonstrate the right work in the corner cases. This will helps reviewers to test your patch and helps you to make sure your new version is ok. For example: =# select ts_headline('1 2 3 4 5 1 2 3 1','13'::tsquery); ts_headline -- b1/b 2 b3/b 4 5 b1/b 2 b3/b b1/b This select breaks your code: =# select ts_headline('1 2 3 4 5 1 2 3 1','13'::tsquery,'maxfragments=2'); ts_headline -- ... 2 ... and so on Oleg On Tue, 15 Jul 2008, Sushant Sinha wrote: Attached a new patch that: 1. fixes previous bug 2. better handles the case when cover size is greater than the MaxWords. Basically it divides a cover greater than MaxWords into fragments of MaxWords, resizes each such fragment so that each end of the fragment contains a query word and then evaluates best fragments based on number of query words in each fragment. In case of tie it picks up the smaller fragment. This allows more query words to be shown with multiple fragments in case a single cover is larger than the MaxWords. The resizing of a fragment such that each end is a query word provides room for stretching both sides of the fragment. This (hopefully) better presents the context in which query words appear in the document. If a cover is smaller than MaxWords then the cover is treated as a fragment. Let me know if you have any more suggestions or anything is not clear. I have not yet added the regression tests. The regression test suite seemed to be only ensuring that the function works. How many tests should I be adding? Is there any other place that I need to add different test cases for the function? -Sushant. Nice. But it will be good to resolve following issues: 1) Patch contains mistakes, I didn't investigate or carefully read it. Get http://www.sai.msu.su/~megera/postgres/fts/apod.dump.gzhttp://www.sai.msu.su/%7Emegera/postgres/fts/apod.dump.gzand load in db. Queries # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') from apod where to_tsvector(body) @@ plainto_tsquery('black hole'); and # select ts_headline(body, plainto_tsquery('black hole'), 'MaxFragments=1') from apod; crash postgresql :( 2) pls, include in your patch documentation and regression tests. Another change that I was thinking: Right now if cover size max_words then I just cut the trailing words. Instead I was thinking that we should split the cover into more fragments such that each fragment contains a few query words. Then each fragment will not contain all query words but will show more occurrences of query words in the headline. I would like to know what your opinion on this is. Agreed. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 Index: src/backend/tsearch/wparser_def.c === RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v retrieving revision 1.15 diff -c -r1.15 wparser_def.c *** src/backend/tsearch/wparser_def.c 17 Jun 2008 16:09:06 - 1.15 --- src/backend/tsearch/wparser_def.c 18 Jul 2008
Re: [HACKERS] phrase search
I looked at query operators for tsquery and here are some of the new query operators for position based queries. I am just proposing some changes and the questions I have. 1. What is the meaning of such a query operator? foo #5 bar - true if the document has word foo followed by bar at 5th position. foo #5 bar - true if document has word foo followed by bar with in 5 positions foo #5 bar - true if document has word foo followed by bar after 5 positions then some other ways it can be used are !(foo #5 bar) - true if document never has any foo followed by bar with in 5 positions. etc . 2. How to implement such query operators? Should we modify QueryItem to include additional distance information or is there any other way to accomplish it? Is the following list sufficient to accomplish this? a. Modify to_tsquery b. Modify TS_execute in tsvector_op.c to check new operator Is there anything needed in rewrite subsystem? 3. Are these valid uses of the operators and if yes what would they mean? foo #5 (bar cup) If no then should the operator be applied to only two QI_VAL's? 4. If the operator only applies to two query items can we create an index such that (foo, bar)- documents[min distance, max distance] How difficult it is to implement an index like this? Thanks, -Sushant. On Thu, 2008-06-05 at 19:37 +0400, Teodor Sigaev wrote: I can add index support and support for arbitrary distance between lexeme. It appears to me that supporting arbitrary boolean expression will be complicated. Can we pull out something from TSQuery? I don't very like an idea to have separated interface for phrase search. Your patch may be a module and used by people who really wants to have a phrase search. Introducing new operator in tsquery allows to use already existing infrastructure of tsquery such as concatenations (, ||, !!), rewrite subsystem etc. But new operation/types specially designed for phrase search makes needing to make that work again. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline
I guess it is more readable to add cover separator at the end of a fragment than in the front. Let me know what you think and I can update it. I think the right place for cover separator is in the structure HeadlineParsedText just like startsel and stopsel. This will enable users to specify their own cover separators. But this will require changes to the structure as well as to the generateHeadline function. This option will not also play well with the default headline generation function. The default MaxWords = 35 seems a bit high for this headline generation function and 20 seems to be more reasonable. Any thoughts? -Sushant. On Wed, Jul 23, 2008 at 7:44 AM, Oleg Bartunov [EMAIL PROTECTED] wrote: btw, is it intentional to have '' in headline ? =# select ts_headline('1 2 3 4 5 1 2 3 1','14'::tsquery,'MaxFragments=1'); ts_headline - ... b4/b 5 b1/b Oleg On Wed, 23 Jul 2008, Teodor Sigaev wrote: Let me know of any other changes that are needed. Looks like ready to commit, but documentation is needed. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/http://www.sai.msu.su/%7Emegera/ phone: +007(495)939-16-83, +007(495)939-23-83
Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline
Sorry for the delay. Here is the patch with FragmentDelimiter option. It requires an extra option in HeadlineParsedText and uses that option during generateHeadline. Implementing notion of fragments in HeadlineParsedText and a separate function to join them seems more complicated. So for the time being I just dump a FragmentDelimiter whenever a new fragment (other than the first one) starts. The patch also contains the updated regression tests/results and also a new test for FragmentDelimiter option. It also contains the documentation for the new options. I have also attached a separate file that tests different aspects of the new headline generation function. Let me know if anything else is needed. -Sushant. On Thu, 2008-07-24 at 00:28 +0400, Oleg Bartunov wrote: On Wed, 23 Jul 2008, Sushant Sinha wrote: I guess it is more readable to add cover separator at the end of a fragment than in the front. Let me know what you think and I can update it. FragmentsDelimiter should *separate* fragments and that says all. Not very difficult algorithmic problem, it's like perl's join(FragmentsDelimiter, @array) I think the right place for cover separator is in the structure HeadlineParsedText just like startsel and stopsel. This will enable users to specify their own cover separators. But this will require changes to the structure as well as to the generateHeadline function. This option will not also play well with the default headline generation function. As soon as we introduce FragmentsDelimiter we should make it configurable. The default MaxWords = 35 seems a bit high for this headline generation function and 20 seems to be more reasonable. Any thoughts? I think we should not change default value because it could change behaviour of existing applications. I'm not sure if it'd be useful and possible to define default values in CREATE TEXT SEARCH PARSER -Sushant. On Wed, Jul 23, 2008 at 7:44 AM, Oleg Bartunov [EMAIL PROTECTED] wrote: btw, is it intentional to have '' in headline ? =# select ts_headline('1 2 3 4 5 1 2 3 1','14'::tsquery,'MaxFragments=1'); ts_headline - ... b4/b 5 b1/b Oleg On Wed, 23 Jul 2008, Teodor Sigaev wrote: Let me know of any other changes that are needed. Looks like ready to commit, but documentation is needed. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/http://www.sai.msu.su/%7Emegera/ phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 Index: src/include/tsearch/ts_public.h === RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/include/tsearch/ts_public.h,v retrieving revision 1.10 diff -c -r1.10 ts_public.h *** src/include/tsearch/ts_public.h 18 Jun 2008 18:42:54 - 1.10 --- src/include/tsearch/ts_public.h 2 Aug 2008 02:40:27 - *** *** 52,59 --- 52,61 int4 curwords; char *startsel; char *stopsel; + char *fragdelim; int2 startsellen; int2 stopsellen; + int2 fragdelimlen; } HeadlineParsedText; /* Index: src/backend/tsearch/wparser_def.c === RCS file: /home/postgres/devel/pgsql-cvs/pgsql/src/backend/tsearch/wparser_def.c,v retrieving revision 1.15 diff -c -r1.15 wparser_def.c *** src/backend/tsearch/wparser_def.c 17 Jun 2008 16:09:06 - 1.15 --- src/backend/tsearch/wparser_def.c 2 Aug 2008 15:25:46 - *** *** 1684,1701 return false; } ! Datum ! prsd_headline(PG_FUNCTION_ARGS) { ! HeadlineParsedText *prs = (HeadlineParsedText *) PG_GETARG_POINTER(0); ! List *prsoptions = (List *) PG_GETARG_POINTER(1); ! TSQuery query = PG_GETARG_TSQUERY(2); ! /* from opt + start and and tag */ ! int min_words = 15; ! int max_words = 35; ! int shortword = 3; int p = 0, q = 0; int bestb = -1, --- 1684,1930 return false; } ! static void ! mark_fragment(HeadlineParsedText *prs, int highlight, int startpos, int endpos) { ! int i; ! for (i = startpos; i = endpos; i++) ! { ! if (prs-words[i].item) ! prs-words[i].selected = 1; ! if (highlight == 0) ! { ! if (HLIDIGNORE(prs-words[i].type)) ! prs-words[i].replace = 1; ! } ! else ! { ! if (XMLHLIDIGNORE(prs-words[i].type
Re: [HACKERS] small bug in hlCover
Has any one noticed this? -Sushant. On Wed, 2008-07-16 at 23:01 -0400, Sushant Sinha wrote: I think there is a slight bug in hlCover function in wparser_def.c If there is only one query item and that is the first word in the text, then hlCover does not returns any cover. This is evident in this example when ts_headline only generates the min_words: testdb=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery, 'MinWords=5'); ts_headline -- b1/b 2 3 4 5 (1 row) The problem is that *q is initialized to 0 which is a legitimate value for a cover. So I have attached a patch that fixes it and after applying the patch here is the result. testdb=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery, 'MinWords=5'); ts_headline - b1/b 2 3 4 5 6 7 8 9 10 (1 row) -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] small bug in hlCover
On Mon, 2008-08-04 at 00:36 -0300, Euler Taveira de Oliveira wrote: Sushant Sinha escreveu: I think there is a slight bug in hlCover function in wparser_def.c The bug is not in the hlCover. In prsd_headline, if we didn't find a suitable bestlen (i.e. = 0), than it includes up to document length or *maxWords* (here is the bug). I'm attaching a small patch that fixes it and some comment typos. Please apply it to 8_3_STABLE too. Well hlCover purpose is to find a cover and for the document '1 2 3 4 5 6 7 8 9 10' and the query '1'::tsquery, a cover exists. So it should point it out. On my source I see that prsd_headline marks only min_words which seems like the right thing to do. -Sushant. euler=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery, 'MinWords=5'); ts_headline - b1/b 2 3 4 5 6 7 8 9 10 (1 registro) euler=# select ts_headline('1 2 3 4 5 6 7 8 9 10','1'::tsquery); ts_headline - b1/b 2 3 4 5 6 7 8 9 10 (1 registro) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] english parser in text search: support for multiple words in the same position
Currently the english parser in text search does not support multiple words in the same position. Consider a word wikipedia.org. The text search would return a single token wikipedia.org. However if someone searches for wikipedia org then there will not be a match. There are two problems here: 1. We do not have separate tokens wikipedia and org 2. If we have the two tokens we should have them at adjacent position so that a phrase search for wikipedia org should work. It will be nice to have the following tokenization and positioning for wikipedia.org position 0: WORD(wikipedia), URL(wikipedia.org) position 1: WORD(org) Take the example of wikipedia.org/search?q=sushant Here is the TSVECTOR: select to_tsvector('english', 'wikipedia.org/search?q=sushant'); to_tsvector '/search?q=sushant':3 'wikipedia.org':2 'wikipedia.org/search?q=sushant':1 And here are the tokens: select ts_debug('english', 'wikipedia.org/search?q=sushant'); ts_debug (url,URL,wikipedia.org/search?q=sushant,{simple},simple,{wikipedia.org/search?q =sushant}) (host,Host,wikipedia.org,{simple},simple,{wikipedia.org}) (url_path,URL path,/search?q=sushant,{simple},simple,{/search?q=sushant}) The tokenization I would like to see is: position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant) position 1: WORD(org) position 2: WORD(search), URL_PATH(search/?q=sushant) position 3: WORD(q), URL_QUERY(q=search) position 4: WORD(sushant) So what we need is to support multiple tokens at the same position. And I need help in understanding how to realize this. Currently the position assignment happens in make_tsvector by working or parsed lexemes. The lexeme is obtained by prsd_nexttoken. However, prsd_nexttoken only returns a single token. Will it be possiblt to store some tokens and return them tokegher? Or can we put a flag to certain tokens that say the position should not be increased? -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] english parser in text search: support for multiple words in the same position
On 08/01/2010 08:04 PM, Sushant Sinha wrote: 1. We do not have separate tokens wikipedia and org 2. If we have the two tokens we should have them at adjacent position so that a phrase search for wikipedia org should work. This would needlessly increase the number of tokens. Instead you'd better make it work like compound word support, having just wikipedia and org as tokens. The current text parser already returns url and url_path. That already increases the number of unique tokens. I am only asking for adding of normal english words as well so that if someone types only wikipedia he gets a match. Searching for wikipedia.org or wikipedia org should then result in the same search query with the two tokens: wikipedia and org. Earlier people have expressed the need to index urls/emails and currently the text parser already does so. Reverting that would be a regression of functionality. Further, a ranking function can take advantage of direct match of a token. position 0: WORD(wikipedia), URL(wikipedia.org/search?q=sushant) IMO the differentiation between WORDs and URLs is not something the text search engine should have to take care a lot. Let it just do the searching and make it do that well. Postgres english parser already emits urls as tokens. Only thing I am asking is on improving the tokenization and positioning. What does a token wikipedia.org/search?q=sushant buy you in terms of text searching? Or even result highlighting? I wouldn't expect anybody to want to search for a full URL, do you? There have been need expressed in past. And an exact token match can result in better ranking functions. For example, a tf-idf ranking will rank matching of such unique tokens significantly higher. -Sushant. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] english parser in text search: support for multiple words in the same position
On Mon, 2010-08-02 at 09:32 -0400, Robert Haas wrote: On Mon, Aug 2, 2010 at 9:12 AM, Sushant Sinha sushant...@gmail.com wrote: The current text parser already returns url and url_path. That already increases the number of unique tokens. I am only asking for adding of normal english words as well so that if someone types only wikipedia he gets a match. [...] Postgres english parser already emits urls as tokens. Only thing I am asking is on improving the tokenization and positioning. Can you write a patch to implement your idea? Yes thats what I am planning to do. I just wanted to see if anyone can help me in estimating whether this is doable in the current parser or I need to write a new one. If possible, then some idea on how to go about implementing? -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] english parser in text search: support for multiple words in the same position
I have attached a patch that emits parts of a host token, a url token, an email token and a file token. Further, it makes sure that a host/url/email/file token and the first part-token are at the same position in tsvector. The two major changes are: 1. Tokenization changes: The patch exploits the special handlers in the text parser to reset the parser position to the start of a host/url/email/file token when it finds one. Special handlers were already used for extracting host and urlpath from a full url. So this is more of an extension of the same idea. 2. Position changes: We do not advance position when we encounter a host/url/email/file token. As a result the first part of that token aligns with the token itself. Attachments: tokens_output.txt: sample queries and results with the patch token_v1.patch:patch wrt cvs head Currently, the patch output parts of the tokens as normal tokens like WORD, NUMWORD etc. Tom argued earlier that this will break backward-compatibility and so it should be outputted as parts of the respective tokens. If there is an agreement over what Tom says, then the current patch can be modified to output subtokens as parts. However, before I complicate the patch with that, I wanted to get feedback on any other major problem with the patch. -Sushant. On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote: Sushant Sinha sushant...@gmail.com writes: This would needlessly increase the number of tokens. Instead you'd better make it work like compound word support, having just wikipedia and org as tokens. The current text parser already returns url and url_path. That already increases the number of unique tokens. I am only asking for adding of normal english words as well so that if someone types only wikipedia he gets a match. The suggestion to make it work like compound words is still a good one, ie given wikipedia.org you'd get back hostwikipedia.org host-part wikipedia host-part org not just the host token as at present. Then the user could decide whether he needed to index hostname components or not, by choosing whether to forward hostname-part tokens to a dictionary or just discard them. If you submit a patch that tries to force the issue by classifying hostname parts as plain words, it'll probably get rejected out of hand on backwards-compatibility grounds. regards, tom lane 1. FILEPATH testdb=# SELECT ts_debug('/stuff/index.html'); ts_debug -- (file,File or path name,/stuff/index.html,{simple},simple,{/stuff/index.html} ) (blank,Space symbols,/,{},,) (asciiword,Word, all ASCII,stuff,{english_stem},english_stem,{stuff}) (blank,Space symbols,/,{},,) (asciiword,Word, all ASCII,index,{english_stem},english_stem,{index}) (blank,Space symbols,.,{},,) (asciiword,Word, all ASCII,html,{english_stem},english_stem,{html}) SELECT to_tsvector('english', '/stuff/index.html'); to_tsvector '/stuff/index.html':0 'html':2 'index':1 'stuff':0 (1 row) 2. URL testdb=# SELECT ts_debug('http://example.com/stuff/index.html'); ts_debug --- (protocol,Protocol head,http://,{},,) (url,URL,example.com/stuff/index.html,{simple},simple,{example.com/stuff/index. html}) (host,Host,example.com,{simple},simple,{example.com}) (asciiword,Word, all ASCII,example,{english_stem},english_stem,{exampl}) (blank,Space symbols,.,{},,) (asciiword,Word, all ASCII,com,{english_stem},english_stem,{com}) (url_path,URL path,/stuff/index.html,{simple},simple,{/stuff/index.html}) (blank,Space symbols,/,{},,) (asciiword,Word, all ASCII,stuff,{english_stem},english_stem,{stuff}) (blank,Space symbols,/,{},,) (asciiword,Word, all ASCII,index,{english_stem},english_stem,{index}) (blank,Space symbols,.,{},,) (asciiword,Word, all ASCII,html,{english_stem},english_stem,{html}) (13 rows) testdb=# SELECT to_tsvector('english', 'http://example.com/stuff/index.html'); to_tsvector '/stuff/index.html':2 'com':1 'exampl':0 'example.com':0 'example.com/stuff/ind ex.html':0 'html':4 'index':3 'stuff':2 3. EMAIL testdb=# SELECT ts_debug('sush...@foo.bar'); ts_debug - (email,Email address,sush...@foo.bar,{simple},simple,{sush
Re: [HACKERS] english parser in text search: support for multiple words in the same position
Updating the patch with emitting parttoken and registering it with snowball config. -Sushant. On Fri, 2010-09-03 at 09:44 -0400, Robert Haas wrote: On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha sushant...@gmail.com wrote: I have attached a patch that emits parts of a host token, a url token, an email token and a file token. Further, it makes sure that a host/url/email/file token and the first part-token are at the same position in tsvector. You should probably add this patch here: https://commitfest.postgresql.org/action/commitfest_view/open Index: src/backend/snowball/snowball.sql.in === RCS file: /projects/cvsroot/pgsql/src/backend/snowball/snowball.sql.in,v retrieving revision 1.6 diff -u -r1.6 snowball.sql.in --- src/backend/snowball/snowball.sql.in 27 Oct 2007 16:01:08 - 1.6 +++ src/backend/snowball/snowball.sql.in 4 Sep 2010 02:59:10 - @@ -22,6 +22,6 @@ WITH _ASCDICTNAME_; ALTER TEXT SEARCH CONFIGURATION _CFGNAME_ ADD MAPPING -FOR word, hword_part, hword +FOR word, hword_part, hword, parttoken WITH _NONASCDICTNAME_; Index: src/backend/tsearch/ts_parse.c === RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/ts_parse.c,v retrieving revision 1.17 diff -u -r1.17 ts_parse.c --- src/backend/tsearch/ts_parse.c 26 Feb 2010 02:01:05 - 1.17 +++ src/backend/tsearch/ts_parse.c 4 Sep 2010 02:59:11 - @@ -19,7 +19,7 @@ #include tsearch/ts_utils.h #define IGNORE_LONGLEXEME 1 - +#define COMPLEX_TOKEN(x) ( x == 4 || x == 5 || x == 6 || x == 18 || x == 17 || x == 18 || x == 19) /* * Lexize subsystem */ @@ -407,8 +407,6 @@ { TSLexeme *ptr = norms; - prs-pos++; /* set pos */ - while (ptr-lexeme) { if (prs-curwords == prs-lenwords) @@ -429,6 +427,10 @@ prs-curwords++; } pfree(norms); + + if (!COMPLEX_TOKEN(type)) +prs-pos++; /* set pos */ + } } while (type 0); Index: src/backend/tsearch/wparser_def.c === RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v retrieving revision 1.33 diff -u -r1.33 wparser_def.c --- src/backend/tsearch/wparser_def.c 19 Aug 2010 05:57:34 - 1.33 +++ src/backend/tsearch/wparser_def.c 4 Sep 2010 02:59:12 - @@ -23,7 +23,7 @@ /* Define me to enable tracing of parser behavior */ -/* #define WPARSER_TRACE */ +//#define WPARSER_TRACE /* Output token categories */ @@ -51,8 +51,9 @@ #define SIGNEDINT 21 #define UNSIGNEDINT 22 #define XMLENTITY 23 +#define PARTTOKEN 24 -#define LASTNUM 23 +#define LASTNUM 24 static const char *const tok_alias[] = { , @@ -78,7 +79,8 @@ float, int, uint, - entity + entity, + parttoken }; static const char *const lex_descr[] = { @@ -105,7 +107,8 @@ Decimal notation, Signed integer, Unsigned integer, - XML entity + XML entity, +Part of file/url/host/email }; @@ -249,7 +252,8 @@ TParserPosition *state; bool ignore; bool wanthost; - + int partstop; + TParserState afterpart; /* silly char */ char c; @@ -617,8 +621,41 @@ } return 1; } +static int +p_ispartbingo(TParser *prs) +{ + int ret = 0; + if (prs-partstop 0) + { + ret = 1; + if (prs-partstop = prs-state-posbyte) + { + prs-state-state = prs-afterpart; + prs-partstop = 0; + } + else + prs-state-state = TPS_Base; + } + return ret; +} +static int +p_ispart(TParser *prs) +{ + if (prs-partstop 0) + return 1; + else + return 0; +} +static int +p_ispartEOF(TParser *prs) +{ + if (p_ispart(prs) p_isEOF(prs)) + return 1; + else + return 0; +} /* deliberately suppress unused-function complaints for the above */ void _make_compiler_happy(void); void @@ -688,6 +725,21 @@ } static void +SpecialPart(TParser *prs) +{ + prs-partstop = prs-state-posbyte; + prs-state-posbyte -= prs-state-lenbytetoken; + prs-state-poschar -= prs-state-lenchartoken; + prs-afterpart = TPS_Base; +} +static void +SpecialUrlPart(TParser *prs) +{ + SpecialPart(prs); + prs-afterpart = TPS_InURLPathStart; +} + +static void SpecialVerVersion(TParser *prs) { prs-state-posbyte -= prs-state-lenbytetoken; @@ -1057,6 +1109,7 @@ {p_iseqC, '-', A_PUSH, TPS_InSignedIntFirst, 0, NULL}, {p_iseqC, '+', A_PUSH, TPS_InSignedIntFirst, 0, NULL}, {p_iseqC, '', A_PUSH, TPS_InXMLEntityFirst, 0, NULL}, + {p_ispart, 0, A_NEXT, TPS_InSpace, 0, NULL}, {p_iseqC, '~', A_PUSH, TPS_InFileTwiddle, 0, NULL}, {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InPathFirstFirst, 0, NULL}, @@ -1065,9 +1118,11 @@ static const TParserStateActionItem actionTPS_InNumWord[] = { + {p_ispartEOF, 0, A_BINGO, TPS_Null, PARTTOKEN, NULL}, {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL}, {p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL}, {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL}, + {p_ispartbingo, 0, A_BINGO
Re: [HACKERS] english parser in text search: support for multiple words in the same position
For the headline generation to work properly, email/file/url/host need to become skip tokens. Updating the patch with that change. -Sushant. On Sat, 2010-09-04 at 13:25 +0530, Sushant Sinha wrote: Updating the patch with emitting parttoken and registering it with snowball config. -Sushant. On Fri, 2010-09-03 at 09:44 -0400, Robert Haas wrote: On Wed, Sep 1, 2010 at 2:42 AM, Sushant Sinha sushant...@gmail.com wrote: I have attached a patch that emits parts of a host token, a url token, an email token and a file token. Further, it makes sure that a host/url/email/file token and the first part-token are at the same position in tsvector. You should probably add this patch here: https://commitfest.postgresql.org/action/commitfest_view/open Index: src/backend/snowball/snowball.sql.in === RCS file: /projects/cvsroot/pgsql/src/backend/snowball/snowball.sql.in,v retrieving revision 1.6 diff -u -r1.6 snowball.sql.in --- src/backend/snowball/snowball.sql.in 27 Oct 2007 16:01:08 - 1.6 +++ src/backend/snowball/snowball.sql.in 7 Sep 2010 01:46:55 - @@ -22,6 +22,6 @@ WITH _ASCDICTNAME_; ALTER TEXT SEARCH CONFIGURATION _CFGNAME_ ADD MAPPING -FOR word, hword_part, hword +FOR word, hword_part, hword, parttoken WITH _NONASCDICTNAME_; Index: src/backend/tsearch/ts_parse.c === RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/ts_parse.c,v retrieving revision 1.17 diff -u -r1.17 ts_parse.c --- src/backend/tsearch/ts_parse.c 26 Feb 2010 02:01:05 - 1.17 +++ src/backend/tsearch/ts_parse.c 7 Sep 2010 01:46:55 - @@ -19,7 +19,7 @@ #include tsearch/ts_utils.h #define IGNORE_LONGLEXEME 1 - +#define COMPLEX_TOKEN(x) ( x == 4 || x == 5 || x == 6 || x == 18 || x == 17 || x == 18 || x == 19) /* * Lexize subsystem */ @@ -407,8 +407,6 @@ { TSLexeme *ptr = norms; - prs-pos++; /* set pos */ - while (ptr-lexeme) { if (prs-curwords == prs-lenwords) @@ -429,6 +427,10 @@ prs-curwords++; } pfree(norms); + + if (!COMPLEX_TOKEN(type)) +prs-pos++; /* set pos */ + } } while (type 0); Index: src/backend/tsearch/wparser_def.c === RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v retrieving revision 1.33 diff -u -r1.33 wparser_def.c --- src/backend/tsearch/wparser_def.c 19 Aug 2010 05:57:34 - 1.33 +++ src/backend/tsearch/wparser_def.c 7 Sep 2010 01:46:56 - @@ -23,7 +23,7 @@ /* Define me to enable tracing of parser behavior */ -/* #define WPARSER_TRACE */ +//#define WPARSER_TRACE /* Output token categories */ @@ -51,8 +51,9 @@ #define SIGNEDINT 21 #define UNSIGNEDINT 22 #define XMLENTITY 23 +#define PARTTOKEN 24 -#define LASTNUM 23 +#define LASTNUM 24 static const char *const tok_alias[] = { , @@ -78,7 +79,8 @@ float, int, uint, - entity + entity, + parttoken }; static const char *const lex_descr[] = { @@ -105,7 +107,8 @@ Decimal notation, Signed integer, Unsigned integer, - XML entity + XML entity, +Part of file/url/host/email }; @@ -249,7 +252,8 @@ TParserPosition *state; bool ignore; bool wanthost; - + int partstop; + TParserState afterpart; /* silly char */ char c; @@ -617,8 +621,41 @@ } return 1; } +static int +p_ispartbingo(TParser *prs) +{ + int ret = 0; + if (prs-partstop 0) + { + ret = 1; + if (prs-partstop = prs-state-posbyte) + { + prs-state-state = prs-afterpart; + prs-partstop = 0; + } + else + prs-state-state = TPS_Base; + } + return ret; +} +static int +p_ispart(TParser *prs) +{ + if (prs-partstop 0) + return 1; + else + return 0; +} +static int +p_ispartEOF(TParser *prs) +{ + if (p_ispart(prs) p_isEOF(prs)) + return 1; + else + return 0; +} /* deliberately suppress unused-function complaints for the above */ void _make_compiler_happy(void); void @@ -688,6 +725,21 @@ } static void +SpecialPart(TParser *prs) +{ + prs-partstop = prs-state-posbyte; + prs-state-posbyte -= prs-state-lenbytetoken; + prs-state-poschar -= prs-state-lenchartoken; + prs-afterpart = TPS_Base; +} +static void +SpecialUrlPart(TParser *prs) +{ + SpecialPart(prs); + prs-afterpart = TPS_InURLPathStart; +} + +static void SpecialVerVersion(TParser *prs) { prs-state-posbyte -= prs-state-lenbytetoken; @@ -1057,6 +1109,7 @@ {p_iseqC, '-', A_PUSH, TPS_InSignedIntFirst, 0, NULL}, {p_iseqC, '+', A_PUSH, TPS_InSignedIntFirst, 0, NULL}, {p_iseqC, '', A_PUSH, TPS_InXMLEntityFirst, 0, NULL}, + {p_ispart, 0, A_NEXT, TPS_InSpace, 0, NULL}, {p_iseqC, '~', A_PUSH, TPS_InFileTwiddle, 0, NULL}, {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InPathFirstFirst, 0, NULL}, @@ -1065,9 +1118,11 @@ static const TParserStateActionItem actionTPS_InNumWord[] = { + {p_ispartEOF, 0
Re: [HACKERS] text search patch status update?
The default headline generation function is complicated. It checks a lot of cases to determine the best headline to be displayed. So Heikki's examples just say that headline generation function may not be very intuitive. However, his examples were not affected by the bug. Because of the bug, hlcover was not returning a cover when the query item was the first lexeme in the text. And so the headline generation function will return just MINWORDS rather than the actual headline as per the logic. After the patch you will see the difference in the example: http://archives.postgresql.org/pgsql-hackers/2008-07/msg00785.php -Sushant. On Wed, 2009-01-07 at 20:50 -0500, Bruce Momjian wrote: Uh, where are we on this? I see the same output in CVS HEAD as Heikki, and I assume he thought at least one of them was wrong. ;-) --- Heikki Linnakangas wrote: Sushant Sinha wrote: Patch #2. I think this is a straigt forward bug fix. Yes, I think you're right. In hlCover(), *q is 0 when the only match is the first item in the text, and we shouldn't bail out with return false in that case. But there seems to be something else going on here as well: postgres=# select ts_headline('1 2 3 4 5', '2'::tsquery, 'MinWords=2, MaxWords=3'); ts_headline -- b2/b 3 4 (1 row) postgres=# select ts_headline('aaa1 aaa2 aaa3 aaa4 aaa5','aaa2'::tsquery, 'MinWords=2, MaxWords=3'); ts_headline -- baaa2/b aaa3 (1 row) In the first example, you get three words, and in the 2nd, just two. It must be because of the default ShortWord setting of 3. Also, if only the last word matches, and it's a short word, you get the whole text: postgres=# select ts_headline('1 2 3 4 5','5'::tsquery, 'MinWords=2, MaxWords=3'); ts_headline -- 1 2 3 4 b5/b (1 row) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] possible bug in cover density ranking?
I am running postgres 8.3.1. In tsrank.c I am looking at the cover density function used for ranking while doing text search: float4 calc_rank_cd(float4 *arrdata, TSVector txt, TSQuery query, int method) Here is the excerpt of code that I think may possibly have bug when document is big enough to exceed the 16383 position limit. CODE === Cpos = ((double) (ext.end - ext.begin + 1)) / InvSum; /* * if doc are big enough then ext.q may be equal to ext.p due to limit * of posional information. In this case we approximate number of * noise word as half cover's length */ nNoise = (ext.q - ext.p) - (ext.end - ext.begin); if (nNoise 0) nNoise = (ext.end - ext.begin) / 2 Wdoc += Cpos / ((double) (1 + nNoise)); === As per my understanding, ext.end -ext.begin + 1 is the number of query items in the cover and ext.q-ext.p says the length of the cover. So consider a query with two query items. When we run out of position information, Cover returns ext.q = 16383 and ext.p = 16383 and the number of query items= ext.end-ext-begin + 1 = 2 nNoise becomes -1 and then nNoise is initialized to (ext.end -ext.begin)/2 = 0 Wdoc becomes Cpos = 2/InvSum = 2/(1/0.1+1/0.1) = 0.1 Is this what is desired? It seems to me that Wdoc is getting a high ranking even when we are not sure of the position information. The comment above says that In this case we approximate number of noise word as half cover's length. But we do not know the cover's length in this case as ext.p and ext.q are both unreliable. And ext.end -ext.begin is not the cover's length. It is the number of query items found in the cover. Any clarification would be useful. Thanks, -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] possible bug in cover density ranking?
On Thu, Jan 29, 2009 at 12:38 PM, Teodor Sigaev teo...@sigaev.ru wrote: Is this what is desired? It seems to me that Wdoc is getting a high ranking even when we are not sure of the position information. 0.1 is not very high rank, and we could not suggest any reasonable rank in this case. This document may be good, may be bad. rank_cd is not limited by 1. For a cover of 2 query items, 0.1 is actually the maximum rank. This is only possible when both query items are adjacent to each other. 0.1 may not seem too high when we look at its absoule value. But the problem is we are ranking a document for which we have no positional information available higher than a document for which we may have positional information available with let suppose the cover length of 3. I think we should rank the document with cover length 3 higher than the document for which we have no positional information (and assume cover length of 2 as we are doing now). I feel that if ext.p=ext.q for query items 1, then we should not count that cover for ranking at all. Or, another option will be to significantly inflate nNoise in this scenrio to say 100. Putting nNoise=(ext.end-ext.begin)/2 is way too low for covers that we have no idea on (it is 0 for query items = 2). I am not assuming or suggesting that rank_cd is bounded by one. Off course its rank increases as more and more covers are added. Thanks, Sushant. The comment above says that In this case we approximate number of noise word as half cover's length. But we do not know the cover's length in this case as ext.p and ext.q are both unreliable. And ext.end -ext.begin is not the cover's length. It is the number of query items found in the cover. Yeah, but if there is no information then information is absent :), but I agree with you to change comment -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/
Re: [HACKERS] Ellipses around result fragment of ts_headline
I think we currently do that. We add ellipses only when we encounter a new fragment. So there should not be ellipses if we are at the end of the document or if that is the first fragment (includes the beginning of the document). Here is the code in generateHeadline, ts_parse.c that adds the ellipses: if (!infrag) { /* start of a new fragment */ infrag = 1; numfragments ++; /* add a fragment delimitor if this is after the first one */ if (numfragments 1) { memcpy(ptr, prs-fragdelim, prs-fragdelimlen); ptr += prs-fragdelimlen; } } It is possible that there is a bug that needs to be fixed. Can you show me an example where you found that? -Sushant. On Sat, 2009-02-14 at 15:13 -0500, Asher Snyder wrote: It would be very useful if there were an option to have ts_headline append ellipses before or after a result fragement based on the position of the fragment in the source document. For instance, when running ts_headline(doc, query) it will correctly return a fragment with words highlighted, however, there's no easy way to determine whether this returned fragment is at the beginning or end of the original doc, and add the necessary ellipses. Searches such as postgresql.org ALWAYS add ellipses before or after the fragment regardless of whether or not ellipses are warranted. In my opinion always adding ellipses to the fragment is deceptive to the user, in many of my search result cases, the fragment is at the beginning of the doc, and would confuse the user to always see ellipses. So you can see how useful the feature described above would be beneficial to the accuracy of the search result fragment. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ellipses around result fragment of ts_headline
The documentation in 8.4dev has information on FragmentDelimiter http://developer.postgresql.org/pgdocs/postgres/textsearch-controls.html If you do not specify MaxFragments 0, then the default headline generator kicks in. The default headline generator does not have any fragment delimiter. So it is correct that you will not see any delimiter. I think you are looking for the default headline generator to add ellipses as well depending on where the fragment is. I do not what other people opinion on this is. -Sushant. On Sat, 2009-02-14 at 16:21 -0500, Asher Snyder wrote: Interesting, it could be that you already do it, but the documentation makes no reference to a fragment delimiter, so there's no way that I can see to add one. The documentation for ts_headline only lists StartSel, StopSel, MaxWords, MinWords, ShortWord, and HighlightAll, there appears to be no option for a fragment delimiter. In my case I do: SELECT v1.id, v1.type_id, v1.title, ts_headline(v1.copy, query, 'MinWords = 17') as copy, ts_rank(v1.text_search, query) AS rank FROM (SELECT b1.*, (setweight(to_tsvector(coalesce(b1.title,'')), 'A') || setweight(to_tsvector(coalesce(b1.copy,'')), 'B')) as text_search FROM search.v_searchable_content b1) v1, plainto_tsquery($1) query WHERE ($2 IS NULL OR (type_id = ANY($2))) AND query @@ v1.text_search ORDER BY rank DESC, title Now, this use of ts_headline correctly returns me highlighted fragmented search results, but there will be no fragment delimiter for the headline. Some suggestions were to change ts_headline(v1.copy, query, 'MinWords = 17') to '...' || _headline(v1.copy, query, 'MinWords = 17') || '...', but as you can clearly see this would always occur, and not be intelligent regarding the fragments. I hope that you're correct and that it is implemented, and not documented -Original Message- From: Sushant Sinha [mailto:sushant...@gmail.com] Sent: Saturday, February 14, 2009 4:07 PM To: Asher Snyder Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Ellipses around result fragment of ts_headline I think we currently do that. We add ellipses only when we encounter a new fragment. So there should not be ellipses if we are at the end of the document or if that is the first fragment (includes the beginning of the document). Here is the code in generateHeadline, ts_parse.c that adds the ellipses: if (!infrag) { /* start of a new fragment */ infrag = 1; numfragments ++; /* add a fragment delimitor if this is after the first one */ if (numfragments 1) { memcpy(ptr, prs-fragdelim, prs-fragdelimlen); ptr += prs-fragdelimlen; } } It is possible that there is a bug that needs to be fixed. Can you show me an example where you found that? -Sushant. On Sat, 2009-02-14 at 15:13 -0500, Asher Snyder wrote: It would be very useful if there were an option to have ts_headline append ellipses before or after a result fragement based on the position of the fragment in the source document. For instance, when running ts_headline(doc, query) it will correctly return a fragment with words highlighted, however, there's no easy way to determine whether this returned fragment is at the beginning or end of the original doc, and add the necessary ellipses. Searches such as postgresql.org ALWAYS add ellipses before or after the fragment regardless of whether or not ellipses are warranted. In my opinion always adding ellipses to the fragment is deceptive to the user, in many of my search result cases, the fragment is at the beginning of the doc, and would confuse the user to always see ellipses. So you can see how useful the feature described above would be beneficial to the accuracy of the search result fragment. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Ellipses around result fragment of ts_headline
Sorry ... I thought you were running the development branch. -Sushant. On Sat, 2009-02-14 at 16:34 -0500, Tom Lane wrote: Sushant Sinha sushant...@gmail.com writes: I think we currently do that. ... since about four months ago. 2008-10-17 14:05 teodor * doc/src/sgml/textsearch.sgml, src/backend/tsearch/ts_parse.c, src/backend/tsearch/wparser_def.c, src/include/tsearch/ts_public.h, src/test/regress/expected/tsearch.out, src/test/regress/sql/tsearch.sql: Improve headeline generation. Now headline can contain several fragments a-la Google. Sushant Sinha sushant...@gmail.com regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] patch for space around the FragmentDelimiter
FragmentDelimiter is an argument for ts_headline function to separates different headline fragments. The default delimiter is ... . Currently if someone specifies the delimiter as an option to the function, no extra space is added around the delimiter. However, it does not look good without space around the delimter. Since the option parsing function removes any space around the given value, it is not possible to add any desired space. The attached patch adds space when a FragmentDelimiter is specified. QUERY: SELECT ts_headline('english', ' Day after day, day after day, We stuck, nor breath nor motion, As idle as a painted Ship Upon a painted Ocean. Water, water, every where And all the boards did shrink; Water, water, every where, Nor any drop to drink. S. T. Coleridge (1772-1834) ', to_tsquery('english', 'Coleridge stuck'), 'MaxFragments=2,FragmentDelimiter=***'); OLD RESULT ts_headline after day, day after day, We bstuck/b, nor breath nor motion, As idle as a painted Ship Upon a painted Ocean. Water, water, every where And all the boards did shrink; Water, water, every where***drop to drink. S. T. bColeridge/b (1 row) NEW RESULT after the patch ts_headline -- after day, day after day, We bstuck/b, nor breath nor motion, As idle as a painted Ship Upon a painted Ocean. Water, water, every where And all the boards did shrink; Water, water, every where *** drop to drink. S. T. bColeridge/b Index: src/backend/tsearch/wparser_def.c === RCS file: /home/sushant/devel/pgrep/pgsql/src/backend/tsearch/wparser_def.c,v retrieving revision 1.20 diff -c -r1.20 wparser_def.c *** src/backend/tsearch/wparser_def.c 15 Jan 2009 16:33:59 - 1.20 --- src/backend/tsearch/wparser_def.c 2 Mar 2009 06:00:02 - *** *** 2082,2087 --- 2082,2088 int shortword = 3; int max_fragments = 0; int highlight = 0; + int len; ListCell *l; /* config */ *** *** 2105,2111 else if (pg_strcasecmp(defel-defname, StopSel) == 0) prs-stopsel = pstrdup(val); else if (pg_strcasecmp(defel-defname, FragmentDelimiter) == 0) ! prs-fragdelim = pstrdup(val); else if (pg_strcasecmp(defel-defname, HighlightAll) == 0) highlight = (pg_strcasecmp(val, 1) == 0 || pg_strcasecmp(val, on) == 0 || --- 2106,2116 else if (pg_strcasecmp(defel-defname, StopSel) == 0) prs-stopsel = pstrdup(val); else if (pg_strcasecmp(defel-defname, FragmentDelimiter) == 0) ! { ! len = strlen(val) + 2 + 1;/* 2 for spaces and 1 for end of string */ ! prs-fragdelim = palloc(len * sizeof(char)); ! snprintf(prs-fragdelim, len, %s , val); ! } else if (pg_strcasecmp(defel-defname, HighlightAll) == 0) highlight = (pg_strcasecmp(val, 1) == 0 || pg_strcasecmp(val, on) == 0 || Index: src/test/regress/expected/tsearch.out === RCS file: /home/sushant/devel/pgrep/pgsql/src/test/regress/expected/tsearch.out,v retrieving revision 1.15 diff -c -r1.15 tsearch.out *** src/test/regress/expected/tsearch.out 17 Oct 2008 18:05:19 - 1.15 --- src/test/regress/expected/tsearch.out 2 Mar 2009 02:02:38 - *** *** 624,630 body bSea/b view wow ubfoo/b bar/u iqq/i a href=http://www.google.com/foo.bar.html; target=_blankYES nbsp;/a ! ff-bg script document.write(15); /script --- 624,630 body bSea/b view wow ubfoo/b bar/u iqq/i a href=http://www.google.com/foo.bar.html; target=_blankYES nbsp;/a ! ff-bg script document.write(15); /script *** *** 712,726 Nor any drop to drink. S. T. Coleridge (1772-1834) ', to_tsquery('english', 'Coleridge stuck'), 'MaxFragments=2,FragmentDelimiter=***'); ! ts_headline ! after day, day after day, We bstuck/b, nor breath nor motion, As idle as a painted Ship Upon a painted Ocean. Water, water, every where And all the boards did shrink; ! Water, water, every where***drop to drink. S. T. bColeridge/b (1 row) --- 712,726 Nor any drop to drink. S. T. Coleridge (1772-1834) ', to_tsquery('english', 'Coleridge stuck'), 'MaxFragments=2,FragmentDelimiter=***'); ! ts_headline ! -- after day, day after day, We bstuck/b, nor breath nor motion, As idle as a painted Ship Upon a painted Ocean. Water, water, every where And all the boards did shrink; ! Water, water, every where *** drop to drink. S. T. bColeridge/b (1 row) -- Sent via pgsql-hackers mailing list
Re: [HACKERS] patch for space around the FragmentDelimiter
yeah you are right. I did not know that you can pass space using double quotes. -Sushant. On Sun, 2009-03-01 at 20:49 -0500, Tom Lane wrote: Sushant Sinha sushant...@gmail.com writes: FragmentDelimiter is an argument for ts_headline function to separates different headline fragments. The default delimiter is ... . Currently if someone specifies the delimiter as an option to the function, no extra space is added around the delimiter. However, it does not look good without space around the delimter. Maybe not to you, for the particular delimiter you happen to be working with, but it doesn't follow that spaces are always appropriate. Since the option parsing function removes any space around the given value, it is not possible to add any desired space. The attached patch adds space when a FragmentDelimiter is specified. I think this is a pretty bad idea. Better would be to document how to get spaces into the delimiter, ie, use double quotes: ... FragmentDelimiter = ... ... Hmm, actually, it looks to me that the documentation already shows this, in the example of the default values. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Fragments in tsearch2 headline
Headline generation uses hlCover to get fragments in text with *all* query items. In case there is no such fragment, it does not return anything. What you are asking will either require returning *maximally* matching covers or handling it as a separate case. -Sushant. On Mon, 2009-04-13 at 20:57 -0400, Tom Lane wrote: Sushant Sinha sushant...@gmail.com writes: Sorry for the delay. Here is the patch with FragmentDelimiter option. It requires an extra option in HeadlineParsedText and uses that option during generateHeadline. I did some editing of the documentation for this patch and noticed that the explanation of the fragment-based headline method says If not all query words are found in the document, then a single fragment of the first literalMinWords/ in the document will be displayed. (That's what it says now, that is, based on my editing and testing of the original.) This seems like a pretty dumb fallback approach --- if you have only a partial match, the headline generation suddenly becomes about as stupid as it could possibly be. I could understand doing the above if the text actually contains *none* of the query words, but surely if it contains some of them we should still select fragments centered on those words. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] possible bug in cover density ranking?
I see this as open items here http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items Any interest in fixing this? -Sushant. On Thu, 2009-01-29 at 13:54 -0500, Sushant Sinha wrote: On Thu, Jan 29, 2009 at 12:38 PM, Teodor Sigaev teo...@sigaev.ru wrote: Is this what is desired? It seems to me that Wdoc is getting a high ranking even when we are not sure of the position information. 0.1 is not very high rank, and we could not suggest any reasonable rank in this case. This document may be good, may be bad. rank_cd is not limited by 1. For a cover of 2 query items, 0.1 is actually the maximum rank. This is only possible when both query items are adjacent to each other. 0.1 may not seem too high when we look at its absoule value. But the problem is we are ranking a document for which we have no positional information available higher than a document for which we may have positional information available with let suppose the cover length of 3. I think we should rank the document with cover length 3 higher than the document for which we have no positional information (and assume cover length of 2 as we are doing now). I feel that if ext.p=ext.q for query items 1, then we should not count that cover for ranking at all. Or, another option will be to significantly inflate nNoise in this scenrio to say 100. Putting nNoise=(ext.end-ext.begin)/2 is way too low for covers that we have no idea on (it is 0 for query items = 2). I am not assuming or suggesting that rank_cd is bounded by one. Off course its rank increases as more and more covers are added. Thanks, Sushant. The comment above says that In this case we approximate number of noise word as half cover's length. But we do not know the cover's length in this case as ext.p and ext.q are both unreliable. And ext.end -ext.begin is not the cover's length. It is the number of query items found in the cover. Yeah, but if there is no information then information is absent :), but I agree with you to change comment -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] dot to be considered as a word delimiter?
Currently it seems like that dot is not considered as a word delimiter by the english parser. lawdb=# select to_tsvector('english', 'Mr.J.Sai Deepak'); to_tsvector - 'deepak':2 'mr.j.sai':1 (1 row) So the word obtained is mr.j.sai rather than three words mr, j, sai It does it correctly if there is space in between, as space is definitely a word delimiter. lawdb=# select to_tsvector('english', 'Mr. J. Sai Deepak'); to_tsvector - 'j':2 'mr':1 'sai':3 'deepak':4 (1 row) I think that dot should be considered by as a word delimiter because when dot is not followed by a space, most of the time it is an error in typing. Beside they are not many valid english words that have dot in between. -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dot to be considered as a word delimiter?
Fair enough. I agree that there is a valid need for returning such tokens as a host. But I think there is definitely a need to break it down into individual words. This will help in cases when a document is missing a space in between the words. So what we can do is: return the entire compound word as Host and also break it down into individual words. I can put up a patch for this if you guys agree. Returning multiple tokens for the same word is a feature of the text search parser as explained in the documentation here: http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html Thanks, Sushant. On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall k...@rice.edu wrote: On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote: Sushant Sinha sushant...@gmail.com wrote: I think that dot should be considered by as a word delimiter because when dot is not followed by a space, most of the time it is an error in typing. Beside they are not many valid english words that have dot in between. It's not treating it as an English word, but as a host name. select ts_debug('english', 'Mr.J.Sai Deepak'); ts_debug --- (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai}) (blank,Space symbols, ,{},,) (asciiword,Word, all ASCII,Deepak,{english_stem},english_stem,{deepak}) (3 rows) You could run it through a dictionary which would deal with host tokens differently. Just be aware of what you'll be doing to www.google.com if you run into it. I hope this helps. -Kevin In our uses for full text indexing, it is much more important to be able to find host name and URLs than to find mistyped names. My two cents. Cheers, Ken
Re: [HACKERS] It's June 1; do you know where your release is?
On Tue, 2009-06-02 at 17:26 -0700, Josh Berkus wrote: * possible bug in cover density ranking? -- From Teodor's response, this is maybe a doc patch and not a code patch. Teodor? Oleg? I personally think that this is a bug, because we are assigning very high rank when we are not sure about the positional information. This is not a show stopper though. -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TS: Limited cover density ranking
The rank counts 1/coversize. So bigger covers will not have much impact anyway. What is the need of the patch? -Sushant. On Fri, 2012-01-27 at 18:06 +0200, karave...@mail.bg wrote: Hello, I have developed a variation of cover density ranking functions that counts only covers that are lesser than a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of usage: -- normal cover density ranking : not changed luben= select ts_rank_cd(to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0.033 (1 row) -- limited to 2 luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0 (1 row) luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'), to_tsquery('ad')); ts_rank_cd 0.1 (1 row) -- limited to 3 luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0.033 (1 row) luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'), to_tsquery('ad')); ts_rank_cd 0.13 (1 row) Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension. I have some questions: 1. Is it interesting to develop it further (documentation, cleanup, etc) for inclusion in one of the next versions? If this is the case, there are some further questions: - should I overload ts_rank_cd (as in examples above and the patch) or should I define new set of functions, for example ts_rank_lcd ? - should I define define this new sql level functions in core or should I go only with this 2 lines change in calc_rank_cd() and define the new functions as an extension? If we prefer the later, could I overload core functions with functions defined in extensions? - and finally there is always the possibility to duplicate the code and make an independent extension. 2. If I run the patched version on cluster that was initialized with unpatched server, is there a way to register the new functions in the system catalog without reinitializing the cluster? Best regards luben -- Luben Karavelov -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] bug in ts_rank_cd
There is a bug in ts_rank_cd. It does not correctly give rank when the query lexeme is the first one in the tsvector. Example: select ts_rank_cd(to_tsvector('english', 'abc sdd'), plainto_tsquery('english', 'abc')); ts_rank_cd 0 select ts_rank_cd(to_tsvector('english', 'bcg abc sdd'), plainto_tsquery('english', 'abc')); ts_rank_cd 0.1 The problem is that the Cover finding algorithm ignores the lexeme at the 0th position, I have attached a patch which fixes it. After the patch the result is fine. select ts_rank_cd(to_tsvector('english', 'abc sdd'), plainto_tsquery( 'english', 'abc')); ts_rank_cd 0.1 --- postgresql-9.0.0/src/backend/utils/adt/tsrank.c 2010-01-02 22:27:55.0 +0530 +++ postgres-9.0.0-tsrankbugfix/src/backend/utils/adt/tsrank.c 2010-12-21 18:39:57.0 +0530 @@ -551,7 +551,7 @@ memset(qr-operandexist, 0, sizeof(bool) * qr-query-size); ext-p = 0x7fff; - ext-q = 0; + ext-q = -1; ptr = doc + ext-pos; /* find upper bound of cover from current position, move up */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] bug in ts_rank_cd
MY PREV EMAIL HAD A PROBLEM. Please reply to this one == There is a bug in ts_rank_cd. It does not correctly give rank when the query lexeme is the first one in the tsvector. Example: select ts_rank_cd(to_tsvector('english', 'abc sdd'), plainto_tsquery('english', 'abc')); ts_rank_cd 0 select ts_rank_cd(to_tsvector('english', 'bcg abc sdd'), plainto_tsquery('english', 'abc')); ts_rank_cd 0.1 The problem is that the Cover finding algorithm ignores the lexeme at the 0th position, I have attached a patch which fixes it. After the patch the result is fine. select ts_rank_cd(to_tsvector('english', 'abc sdd'), plainto_tsquery( 'english', 'abc')); ts_rank_cd 0.1 --- postgresql-9.0.0/src/backend/utils/adt/tsrank.c 2010-01-02 22:27:55.0 +0530 +++ postgres-9.0.0-tsrankbugfix/src/backend/utils/adt/tsrank.c 2010-12-21 18:39:57.0 +0530 @@ -551,7 +551,7 @@ memset(qr-operandexist, 0, sizeof(bool) * qr-query-size); ext-p = 0x7fff; - ext-q = 0; + ext-q = -1; ptr = doc + ext-pos; /* find upper bound of cover from current position, move up */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bug in ts_rank_cd
Sorry for sounding the false alarm. I was not running the vanilla postgres and that is why I was seeing that problem. Should have checked with the vanilla one. -Sushant On Tue, 2010-12-21 at 23:03 -0500, Tom Lane wrote: Sushant Sinha sushant...@gmail.com writes: There is a bug in ts_rank_cd. It does not correctly give rank when the query lexeme is the first one in the tsvector. Hmm ... I cannot reproduce the behavior you're complaining of. You say select ts_rank_cd(to_tsvector('english', 'abc sdd'), plainto_tsquery('english', 'abc')); ts_rank_cd 0 but I get regression=# select ts_rank_cd(to_tsvector('english', 'abc sdd'), regression(# plainto_tsquery('english', 'abc')); ts_rank_cd 0.1 (1 row) The problem is that the Cover finding algorithm ignores the lexeme at the 0th position, As far as I can tell, there is no 0th position --- tsvector counts positions from one. The only way to see pos == 0 in the input to Cover() is if the tsvector has been stripped of position information. ts_rank_cd is documented to return 0 in that situation. Your patch would have the effect of causing it to return some nonzero, but quite bogus, ranking. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] english parser in text search: support for multiple words in the same position
Just a reminder that this patch is discussing how to break url, emails etc into its components. On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane t...@sss.pgh.pa.us wrote: [ sorry for not responding on this sooner, it's been hectic the last couple weeks ] Sushant Sinha sushant...@gmail.com writes: I looked at this patch a bit. I'm fairly unhappy that it seems to be inventing a brand new mechanism to do something the ts parser can already do. Why didn't you code the url-part mechanism using the existing support for compound words? I am not familiar with compound word implementation and so I am not sure how to split a url with compound word support. I looked into the documentation for compound words and that does not say much about how to identify components of a token. IIRC, the way that that works is associated with pushing a sub-state of the state machine in order to scan each compound-word part. I don't have the details in my head anymore, though I recall having traced through it in the past. Look at the state machine actions that are associated with producing the compound word tokens and sub-tokens. I did look around for compound word support in postgres. In particular, I read the documentation and code in tsearch/spell.c that seems to implement the compound word support. So in my understanding the way it works is: 1. Specify a dictionary of words in which each word will have applicable prefix/suffix flags 2. Specify a flag file that provides prefix/suffix operations on those flags 3. flag z indicates that a word in the dictionary can participate in compound word splitting 4. When a token matches words specified in the dictionary (after applying affix/suffix operations), the matching words are emitted as sub-words of the token (i.e., compound word) If my above understanding is correct, then I think it will not be possible to implement url/email splitting using the compound word support. The main reason is that the compound word support requires the PRE-DETERMINED dictionary of words. So to split a url/email we will need to provide a list of *all possible* host names and user names. I do not think that is a possibility. Please correct me if I have mis-understood something. -Sushant.
Re: [HACKERS] english parser in text search: support for multiple words in the same position
Do not know if this mail got lost in between or no one noticed it! On Thu, 2010-12-23 at 11:05 +0530, Sushant Sinha wrote: Just a reminder that this patch is discussing how to break url, emails etc into its components. On Mon, Oct 4, 2010 at 3:54 AM, Tom Lane t...@sss.pgh.pa.us wrote: [ sorry for not responding on this sooner, it's been hectic the last couple weeks ] Sushant Sinha sushant...@gmail.com writes: I looked at this patch a bit. I'm fairly unhappy that it seems to be inventing a brand new mechanism to do something the ts parser can already do. Why didn't you code the url-part mechanism using the existing support for compound words? I am not familiar with compound word implementation and so I am not sure how to split a url with compound word support. I looked into the documentation for compound words and that does not say much about how to identify components of a token. IIRC, the way that that works is associated with pushing a sub-state of the state machine in order to scan each compound-word part. I don't have the details in my head anymore, though I recall having traced through it in the past. Look at the state machine actions that are associated with producing the compound word tokens and sub-tokens. I did look around for compound word support in postgres. In particular, I read the documentation and code in tsearch/spell.c that seems to implement the compound word support. So in my understanding the way it works is: 1. Specify a dictionary of words in which each word will have applicable prefix/suffix flags 2. Specify a flag file that provides prefix/suffix operations on those flags 3. flag z indicates that a word in the dictionary can participate in compound word splitting 4. When a token matches words specified in the dictionary (after applying affix/suffix operations), the matching words are emitted as sub-words of the token (i.e., compound word) If my above understanding is correct, then I think it will not be possible to implement url/email splitting using the compound word support. The main reason is that the compound word support requires the PRE-DETERMINED dictionary of words. So to split a url/email we will need to provide a list of *all possible* host names and user names. I do not think that is a possibility. Please correct me if I have mis-understood something. -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] text search: restricting the number of parsed words in headline generation
I will do the profiling and present the results. On Wed, 2012-08-15 at 12:45 -0400, Tom Lane wrote: Bruce Momjian br...@momjian.us writes: Is this a TODO? AFAIR nothing's been done about the speed issue, so yes. I didn't like the idea of creating a user-visible knob when the speed issue might be fixable with internal algorithm improvements, but we never followed up on this in either fashion. regards, tom lane --- On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote: Sushant Sinha sushant...@gmail.com writes: Doesn't this force the headline to be taken from the first N words of the document, independent of where the match was? That seems rather unworkable, or at least unhelpful. In headline generation function, we don't have any index or knowledge of where the match is. We discover the matches by first tokenizing and then comparing the matches with the query tokens. So it is hard to do anything better than first N words. After looking at the code in wparser_def.c a bit more, I wonder whether this patch is doing what you think it is. Did you do any profiling to confirm that tokenization is where the cost is? Because it looks to me like the match searching in hlCover() is at least O(N^2) in the number of tokens in the document, which means it's probably the dominant cost for any long document. I suspect that your patch helps not so much because it saves tokenization costs as because it bounds the amount of effort spent in hlCover(). I haven't tried to do anything about this, but I wonder whether it wouldn't be possible to eliminate the quadratic blowup by saving more state across the repeated calls to hlCover(). At the very least, it shouldn't be necessary to find the last query-token occurrence in the document from scratch on each and every call. Actually, this code seems probably flat-out wrong: won't every successful call of hlCover() on a given document return exactly the same q value (end position), namely the last token occurrence in the document? How is that helpful? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_trgm: unicode string not working
I am using pg_trgm for spelling correction as prescribed in the documentation. But I see that it does not work for unicode sring. The database was initialized with utf8 encoding and the C locale. Here is the table: \d words Table public.words Column | Type | Modifiers +-+--- word | text| ndoc | integer | nentry | integer | Indexes: words_idx gin (word gin_trgm_ops) Query: select word from words where word % 'कतद'; I get an error: ERROR: GIN indexes do not support whole-index scans Any idea what is wrong? -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PL/Python: No stack trace for an exception
I am using plpythonu on postgres 9.0.2. One of my python functions was throwing a TypeError exception. However, I only see the exception in the database and not the stack trace. It becomes difficult to debug if the stack trace is absent in Python. logdb=# select get_words(forminput) from fi; ERROR: PL/Python: TypeError: an integer is required CONTEXT: PL/Python function get_words And here is the error if I run that function on the same data in python: Traceback (most recent call last): File valid.py, line 215, in module parse_query(result['forminput']) File valid.py, line 132, in parse_query dateobj = datestr_to_obj(columnHash[column]) File valid.py, line 37, in datestr_to_obj dateobj = datetime.date(words[2], words[1], words[0]) TypeError: an integer is required Is this a known problem or this needs addressing? Thanks, Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PL/Python: No stack trace for an exception
On Thu, 2011-07-21 at 15:31 +0200, Jan Urbański wrote: On 21/07/11 15:27, Sushant Sinha wrote: I am using plpythonu on postgres 9.0.2. One of my python functions was throwing a TypeError exception. However, I only see the exception in the database and not the stack trace. It becomes difficult to debug if the stack trace is absent in Python. logdb=# select get_words(forminput) from fi; ERROR: PL/Python: TypeError: an integer is required CONTEXT: PL/Python function get_words And here is the error if I run that function on the same data in python: [traceback] Is this a known problem or this needs addressing? Yes, traceback support in PL/Python has already been implemented and is a new feature that will be available in PostgreSQL 9.1. Cheers, Jan Thanks Jan! Just one more reason to try 9.1. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] text search: restricting the number of parsed words in headline generation
Given a document and a query, the goal of headline generation is to produce text excerpts in which the query appears. Currently the headline generation in postgres follows the following steps: 1. Tokenize the documents and obtain the lexemes 2. Decide on lexemes that should be the part of the headline 3. Generate the headline So the time taken by the headline generation is directly dependent on the size of the document. The longer the document, the more time taken to tokenize and more lexemes to operate on. Most of the time is taken during the tokenization phase and for very big documents, the headline generation is very expensive. Here is a simple patch that limits the number of words during the tokenization phase and puts an upper-bound on the headline generation. The headline function takes a parameter MaxParsedWords. If this parameter is negative or not supplied, then the entire document is tokenized and operated on (the current behavior). However, if the supplied MaxParsedWords is a positive number, then the tokenization stops after MaxParsedWords is obtained. The remaining headline generation happens on the tokens obtained till that point. The current patch can be applied to 9.1rc1. It lacks changes to the documentation and test cases. I will add them if you folks agree on the functionality. -Sushant. diff -ru postgresql-9.1rc1/src/backend/tsearch/ts_parse.c postgresql-9.1rc1-dev/src/backend/tsearch/ts_parse.c --- postgresql-9.1rc1/src/backend/tsearch/ts_parse.c 2011-08-19 02:53:13.0 +0530 +++ postgresql-9.1rc1-dev/src/backend/tsearch/ts_parse.c 2011-08-23 21:27:10.0 +0530 @@ -525,10 +525,11 @@ } void -hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query, char *buf, int buflen) +hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query, char *buf, int buflen, int max_parsed_words) { int type, -lenlemm; +lenlemm, +numparsed = 0; char *lemm = NULL; LexizeData ldata; TSLexeme *norms; @@ -580,8 +581,8 @@ else addHLParsedLex(prs, query, lexs, NULL); } while (norms); - - } while (type 0); + numparsed += 1; + } while (type 0 (max_parsed_words 0 || numparsed max_parsed_words)); FunctionCall1((prsobj-prsend), PointerGetDatum(prsdata)); } --- postgresql-9.1rc1/src/backend/tsearch/wparser.c 2011-08-19 02:53:13.0 +0530 +++ postgresql-9.1rc1-dev/src/backend/tsearch/wparser.c 2011-08-23 21:30:12.0 +0530 @@ -304,6 +304,8 @@ text *out; TSConfigCacheEntry *cfg; TSParserCacheEntry *prsobj; + ListCell *l; +int max_parsed_words = -1; cfg = lookup_ts_config_cache(PG_GETARG_OID(0)); prsobj = lookup_ts_parser_cache(cfg-prsId); @@ -317,13 +319,21 @@ prs.lenwords = 32; prs.words = (HeadlineWordEntry *) palloc(sizeof(HeadlineWordEntry) * prs.lenwords); - hlparsetext(cfg-cfgId, prs, query, VARDATA(in), VARSIZE(in) - VARHDRSZ); if (opt) prsoptions = deserialize_deflist(PointerGetDatum(opt)); else prsoptions = NIL; + foreach(l, prsoptions) + { + DefElem*defel = (DefElem *) lfirst(l); + char *val = defGetString(defel); + if (pg_strcasecmp(defel-defname, MaxParsedWords) == 0) + max_parsed_words = pg_atoi(val, sizeof(int32), 0); +} + + hlparsetext(cfg-cfgId, prs, query, VARDATA(in), VARSIZE(in) - VARHDRSZ, max_parsed_words); FunctionCall3((prsobj-prsheadline), PointerGetDatum(prs), PointerGetDatum(prsoptions), diff -ru postgresql-9.1rc1/src/include/tsearch/ts_utils.h postgresql-9.1rc1-dev/src/include/tsearch/ts_utils.h --- postgresql-9.1rc1/src/include/tsearch/ts_utils.h 2011-08-19 02:53:13.0 +0530 +++ postgresql-9.1rc1-dev/src/include/tsearch/ts_utils.h 2011-08-23 21:04:14.0 +0530 @@ -98,7 +98,7 @@ */ extern void hlparsetext(Oid cfgId, HeadlineParsedText *prs, TSQuery query, - char *buf, int4 buflen); + char *buf, int4 buflen, int max_parsed_words); extern text *generateHeadline(HeadlineParsedText *prs); /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] text search: restricting the number of parsed words in headline generation
Here is a simple patch that limits the number of words during the tokenization phase and puts an upper-bound on the headline generation. Doesn't this force the headline to be taken from the first N words of the document, independent of where the match was? That seems rather unworkable, or at least unhelpful. regards, tom lane In headline generation function, we don't have any index or knowledge of where the match is. We discover the matches by first tokenizing and then comparing the matches with the query tokens. So it is hard to do anything better than first N words. One option could be that we start looking for good match while tokenizing and then stop if we have found good match. Currently the algorithms that decide a good match operates independently of the tokenization and there are two of them. So integrating them would not be easy. The patch is very helpful if you believe in the common case assumption that most of the time a good match is at the top of the document. Typically a search application generates headline for the top matches of a query i.e., those in which the query terms appears frequently. So there should be atleast one or two good text excerpt matches at the top of the document. -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] text search: restricting the number of parsed words in headline generation
Actually, this code seems probably flat-out wrong: won't every successful call of hlCover() on a given document return exactly the same q value (end position), namely the last token occurrence in the document? How is that helpful? regards, tom lane There is a line that saves the computation state from the previous call and search only starts from there: int pos = *p;
Re: [HACKERS] english parser in text search: support for multiple words in the same position
I looked at this patch a bit. I'm fairly unhappy that it seems to be inventing a brand new mechanism to do something the ts parser can already do. Why didn't you code the url-part mechanism using the existing support for compound words? I am not familiar with compound word implementation and so I am not sure how to split a url with compound word support. I looked into the documentation for compound words and that does not say much about how to identify components of a token. Does a compound word split by matching with a list of words? If yes, then we will not be able to use that as we do not know all the words that can appear in a url/host/email/file. I think another approach can be to use the dict_regex dictionary support. However, we will have to match the regex with something that parser is doing. The current patch is not inventing any new mechanism. It uses the special handler mechanism already present in the parser. For example, when the current parser finds a URL it runs a special handler called SpecialFURL which resets the parser position to the start of token to find hostname. After finding the host it moves to finding the path. So you first get the URL and then the host and finally the path. Similarly, we are resetting the parser to the start of the token on finding a url to output url parts. Then before entering the state that can lead to a url we output the url part. The state machine modification is similar for other tokens like file/email/host. The changes made to parsetext() seem particularly scary: it's not clear at all that that's not breaking unrelated behaviors. In fact, the changes in the regression test results suggest strongly to me that it *is* breaking things. Why are there so many diffs in examples that include no URLs at all? I think some of the difference is coming from the fact that now pos starts with 0 and it used to be 1 earlier. That is easily fixable though. An issue that's nearly as bad is the 100% lack of documentation, which makes the patch difficult to review because it's hard to tell what it intends to accomplish or whether it's met the intent. The patch is not committable without documentation anyway, but right now I'm not sure it's even usefully reviewable. I did not provide any explanation as I could not find any place in the code to provide the documentation (that was just a modification of state machine). Should I do a separate write-up to explain the desired output and the changes to achieve it? In line with the lack of documentation, I would say that the choice of the name parttoken for the new token type is not helpful. Part of what? And none of the other token type names include the word token, so that's not a good decision either. Possibly url_part would be a suitable name. I can modify it to output url-part/host-part/email-part/file-part if there is an agreement over the rest of the issues. So let me know if I should go ahead with this. -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Configuring Text Search parser?
Your changes are somewhat fine. It will get you tokens with _ characters in it. However, it is not nice to mix your new token with existing token like NUMWORD. Give a new name to your new type of token .. probably UnderscoreWord. Then on seeing _, move to a state that can identify the new token. If you finally recognize that token, then output it. In order to extract portions of the newly created token, you can write a special handler for the token that resets the parser position to the start of the token to get parts of it. And then modify the state machine to output the part-token before going into the state that can lead to the token that was identified earlier. Look at these changes to the text parser as well: http://archives.postgresql.org/pgsql-hackers/2010-09/msg4.php -Sushant. On Mon, 2010-09-20 at 16:01 +0200, jes...@krogh.cc wrote: Hi. I'm trying to migrate an application off an existing Full Text Search engine and onto PostgreSQL .. one of my main (remaining) headaches are the fact that PostgreSQL treats _ as a seperation charachter whereas the existing behaviour is to not split. That means: testdb=# select ts_debug('database_tag_number_999'); ts_debug -- (asciiword,Word, all ASCII,database,{english_stem},english_stem,{databas}) (blank,Space symbols,_,{},,) (asciiword,Word, all ASCII,tag,{english_stem},english_stem,{tag}) (blank,Space symbols,_,{},,) (asciiword,Word, all ASCII,number,{english_stem},english_stem,{number}) (blank,Space symbols,_,{},,) (uint,Unsigned integer,999,{simple},simple,{999}) (7 rows) Where the incoming data, by design contains a set of tags which includes _ and are expected to be one lexeme. I've tried patching my way out of this using this patch. $ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig src/backend/tsearch/wparser_def.c *** src/backend/tsearch/wparser_def.c.orig2010-09-20 15:58:37.06460 +0200 --- src/backend/tsearch/wparser_def.c 2010-09-20 15:58:41.193335577 +0200 *** *** 967,986 --- 967,988 static const TParserStateActionItem actionTPS_InNumWord[] = { {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL}, {p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL}, {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL}, + {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL}, {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL}, {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL}, {p_iseqC, '-', A_PUSH, TPS_InHyphenNumWordFirst, 0, NULL}, {NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL} }; static const TParserStateActionItem actionTPS_InAsciiWord[] = { {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL}, {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL}, + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL}, {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL}, {p_iseqC, '-', A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL}, {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL}, *** *** 995,1004 --- 997,1007 static const TParserStateActionItem actionTPS_InWord[] = { {p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL}, {p_isalpha, 0, A_NEXT, TPS_Null, 0, NULL}, {p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL}, + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL}, {p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL}, {p_iseqC, '-', A_PUSH, TPS_InHyphenWordFirst, 0, NULL}, {NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL} }; This will obviously break other peoples applications, so my questions would be: If this should be made configurable.. how should it be done? As a sidenote... Xapian doesn't split on _ .. Lucene does. Thanks. -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] english parser in text search: support for multiple words in the same position
Any updates on this? On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha sushant...@gmail.comwrote: I looked at this patch a bit. I'm fairly unhappy that it seems to be inventing a brand new mechanism to do something the ts parser can already do. Why didn't you code the url-part mechanism using the existing support for compound words? I am not familiar with compound word implementation and so I am not sure how to split a url with compound word support. I looked into the documentation for compound words and that does not say much about how to identify components of a token. Does a compound word split by matching with a list of words? If yes, then we will not be able to use that as we do not know all the words that can appear in a url/host/email/file. I think another approach can be to use the dict_regex dictionary support. However, we will have to match the regex with something that parser is doing. The current patch is not inventing any new mechanism. It uses the special handler mechanism already present in the parser. For example, when the current parser finds a URL it runs a special handler called SpecialFURL which resets the parser position to the start of token to find hostname. After finding the host it moves to finding the path. So you first get the URL and then the host and finally the path. Similarly, we are resetting the parser to the start of the token on finding a url to output url parts. Then before entering the state that can lead to a url we output the url part. The state machine modification is similar for other tokens like file/email/host. The changes made to parsetext() seem particularly scary: it's not clear at all that that's not breaking unrelated behaviors. In fact, the changes in the regression test results suggest strongly to me that it *is* breaking things. Why are there so many diffs in examples that include no URLs at all? I think some of the difference is coming from the fact that now pos starts with 0 and it used to be 1 earlier. That is easily fixable though. An issue that's nearly as bad is the 100% lack of documentation, which makes the patch difficult to review because it's hard to tell what it intends to accomplish or whether it's met the intent. The patch is not committable without documentation anyway, but right now I'm not sure it's even usefully reviewable. I did not provide any explanation as I could not find any place in the code to provide the documentation (that was just a modification of state machine). Should I do a separate write-up to explain the desired output and the changes to achieve it? In line with the lack of documentation, I would say that the choice of the name parttoken for the new token type is not helpful. Part of what? And none of the other token type names include the word token, so that's not a good decision either. Possibly url_part would be a suitable name. I can modify it to output url-part/host-part/email-part/file-part if there is an agreement over the rest of the issues. So let me know if I should go ahead with this. -Sushant.
Re: [HACKERS] Re: [GENERAL] Text search parser's treatment of URLs and emails
On Tue, 2010-10-12 at 19:31 -0400, Tom Lane wrote: This seems much of a piece with the existing proposal to allow individual words of a URL to be reported separately: https://commitfest.postgresql.org/action/patch_view?id=378 As I said in that thread, this could be done in a backwards-compatible way using the tsearch parser's existing ability to report multiple overlapping tokens out of the same piece of text. But I'd like to see one unified proposal and patch for this and Sushant's patch, not independent hacks changing the behavior in the same area. regards, tom lane What Tom has suggested will require me to look into a different piece of code and so this will take some time before I can update the patch. -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] planner row-estimates for tsvector seems horribly wrong
I am using gin index on a tsvector and doing basic search. I see the row-estimate of the planner to be horribly wrong. It is returning row-estimate as 4843 for all queries whether it matches zero rows, a medium number of rows (88,000) or a large number of rows (726,000). The table has roughly a million docs. I see a similar problem reported here but thought it was fixed in 9.0 which I am running. http://archives.postgresql.org/pgsql-hackers/2010-05/msg01389.php Here is the version info and detailed planner output for all the three queries: select version(); version PostgreSQL 9.0.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Gentoo 4.3.4 p1.1, pie-10.1.5) 4.3.4, 64-bit Case I: FOR A NON-MATCHING WORD === explain analyze select count(*) from docmeta, plainto_tsquery('english', 'dyfdfdf') as qdoc where docvector @@ qdoc; QUERY PLAN Aggregate (cost=20322.17..20322.18 rows=1 width=0) (actual time=0.058..0.058 rows=1 loops=1) - Nested Loop (cost=5300.28..20310.06 rows=4843 width=0) (actual time=0.055..0.055 rows=0 loops=1) - Function Scan on qdoc (cost=0.00..0.01 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=1) - Bitmap Heap Scan on docmeta (cost=5300.28..20249.51 rows=4843 width=270) (actual time=0.046..0.046 rows=0 loops=1) Recheck Cond: (docmeta.docvector @@ qdoc.qdoc) - Bitmap Index Scan on doc_index (cost=0.00..5299.07 rows=4843 width=0) (actual time=0.044..0.044 rows=0 loops=1) Index Cond: (docmeta.docvector @@ qdoc.qdoc) Total runtime: 0.092 ms CASE II: FOR A MEDIUM-MATCHING WORD === explain analyze select count(*) from docmeta, plainto_tsquery('english', 'quit') as qdoc where docvector @@ qdoc; QUERY PLAN Aggregate (cost=20322.17..20322.18 rows=1 width=0) (actual time=1222.856..1222.857 rows=1 loops=1) - Nested Loop (cost=5300.28..20310.06 rows=4843 width=0) (actual time=639.275..1212.460 rows=88545 loops=1) - Function Scan on qdoc (cost=0.00..0.01 rows=1 width=32) (actual time=0.006..0.007 rows=1 loops=1) - Bitmap Heap Scan on docmeta (cost=5300.28..20249.51 rows=4843 width=270) (actual time=639.264..1196.542 rows=88545 loops=1) Recheck Cond: (docmeta.docvector @@ qdoc.qdoc) - Bitmap Index Scan on doc_index (cost=0.00..5299.07 rows=4843 width=0) (actual time=621.877..621.877 rows=88545 loops=1) Index Cond: (docmeta.docvector @@ qdoc.qdoc) Total runtime: 1222.907 ms Case II: FOR A HIGH-MATCHING WORD = explain analyze select count(*) from docmeta, plainto_tsquery('english', 'j') as qdoc where docvector @@ qdoc; QUERY PLAN Aggregate (cost=20322.17..20322.18 rows=1 width=0) (actual time=742.857..742.858 rows=1 loops=1) - Nested Loop (cost=5300.28..20310.06 rows=4843 width=0) (actual time=126.804..660.895 rows=726985 loops=1) - Function Scan on qdoc (cost=0.00..0.01 rows=1 width=32) (actual time=0.004..0.006 rows=1 loops=1) - Bitmap Heap Scan on docmeta (cost=5300.28..20249.51 rows=4843 width=270) (actual time=126.795..530.422 rows=726985 loops=1) Recheck Cond: (docmeta.docvector @@ qdoc.qdoc) - Bitmap Index Scan on doc_index (cost=0.00..5299.07 rows=4843 width=0) (actual time=113.742..113.742 rows=726985 loops=1) Index Cond: (docmeta.docvector @@ qdoc.qdoc) Total runtime: 742.906 ms Thanks, Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] planner row-estimates for tsvector seems horribly wrong
Thanks a ton Jan! It works quite correctly. But many tsearch tutorials ask tsquery to be placed in 'from' statement and that can cause bad plan. Isn't it possible to return the correct number for a join with the query as well? -Sushant. On Sun, 2010-10-24 at 15:07 +0200, Jan Urbański wrote: On 24/10/10 14:44, Sushant Sinha wrote: I am using gin index on a tsvector and doing basic search. I see the row-estimate of the planner to be horribly wrong. It is returning row-estimate as 4843 for all queries whether it matches zero rows, a medium number of rows (88,000) or a large number of rows (726,000). The table has roughly a million docs. explain analyze select count(*) from docmeta, plainto_tsquery('english', 'dyfdfdf') as qdoc where docvector @@ qdoc; OK, forget my previous message. The problem is that you are doing a join using @@ as the operator for the join condition, so the planner uses the operator's join selectivity estimate. For @@ the tsmatchjoinsel function simply returns 0.005. Try doing: explain analyze select count(*) from docmeta where docvector @@ plainto_tsquery('english', 'dyfdfdf'); It should help. Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] lexemes in prefix search going through dictionary modifications
I am currently using the prefix search feature in text search. I find that the prefix characters are treated the same as a normal lexeme and passed through stemming and stopword dictionaries. This seems like a bug to me. db=# select to_tsquery('english', 's:*'); NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored to_tsquery (1 row) db=# select to_tsquery('simple', 's:*'); to_tsquery 's':* (1 row) I also think that this is a mistake. It should only be highlighting s. db=# select ts_headline('sushant', to_tsquery('simple', 's:*')); ts_headline bsushant/b Thanks, Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lexemes in prefix search going through dictionary modifications
On Tue, 2011-10-25 at 18:05 +0200, Florian Pflug wrote: On Oct25, 2011, at 17:26 , Sushant Sinha wrote: I am currently using the prefix search feature in text search. I find that the prefix characters are treated the same as a normal lexeme and passed through stemming and stopword dictionaries. This seems like a bug to me. Hm, I don't think so. If they don't pass through stopword dictionaries, then queries containing stopwords will fail to find any rows - which is probably not what one would expect. I think what you are saying a feature is really a bug. I am fairly sure that when someone says to_tsquery('english', 's:*') one is looking for an entry that has a *non-stopword* word that starts with 's'. And specially so in a text search configuration that eliminates stop words. Does it even make sense to stem, abbreviate, synonym for a few letters? It will be so unpredictable. -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lexemes in prefix search going through dictionary modifications
On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote: Assume, for example, that the postgres mailing list archive search used tsearch (which I think it does, but I'm not sure). It'd then probably make sense to add postgres to the list of stopwords, because it's bound to appear in nearly every mail. But wouldn't you want searched which include 'postgres*' to turn up empty? Quite certainly not. That improves recall for postgres:* query and certainly doesn't help other queries like post:*. But more importantly it affects precision for all queries like a:*, an:*, and:*, s:*, 't:*', the:*, etc (When that is the only search it also affects recall as no row matches an empty tsquery). Since stopwords are smaller, it means prefix search for a few characters is meaningless. And I would argue that is when the prefix search is more important -- only when you know a few characters. -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] a tsearch issue
On Fri, 2011-11-04 at 11:22 +0100, Pavel Stehule wrote: Hello I found a interesting issue when I checked a tsearch prefix searching. We use a ispell based dictionary CREATE TEXT SEARCH DICTIONARY cspell (template=ispell, dictfile = czech, afffile=czech, stopwords=czech); CREATE TEXT SEARCH CONFIGURATION cs (copy=english); ALTER TEXT SEARCH CONFIGURATION cs ALTER MAPPING FOR word, asciiword WITH cspell, simple; Then I created a table postgres=# create table n(a varchar); CREATE TABLE postgres=# insert into n values('Stěhule'),('Chromečka'); INSERT 0 2 postgres=# select * from n; a ─── Stěhule Chromečka (2 rows) and I tested a prefix searching: I found a following issue postgres=# select * from n where to_tsvector('cs', a) @@ to_tsquery('cs','Stě:*') ; a ─── (0 rows) Most likely you are hit by this problem. http://archives.postgresql.org/pgsql-hackers/2011-10/msg01347.php 'Stě' may be a stopword in czech. I expected one row. The problem is in transformation of word 'Stě' postgres=# select * from ts_debug('cs','Stě:*') ; ─[ RECORD 1 ]┬── alias│ word description │ Word, all letters token│ Stě dictionaries │ {cspell,simple} dictionary │ cspell lexemes │ {sto} ─[ RECORD 2 ]┼── alias│ blank description │ Space symbols token│ :* dictionaries │ {} dictionary │ [null] lexemes │ [null] ':*' is only specific to to_tsquery. ts_debug just invokes the parser. So this is not correct. -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lexemes in prefix search going through dictionary modifications
I think there is a need to provide prefix search to bypass dictionaries.If you folks think that there is some credibility to such a need then I can think about implementing it. How about an operator like :# that does this? The :* will continue to mean the same as currently. -Sushant. On Tue, 2011-10-25 at 23:45 +0530, Sushant Sinha wrote: On Tue, 2011-10-25 at 19:27 +0200, Florian Pflug wrote: Assume, for example, that the postgres mailing list archive search used tsearch (which I think it does, but I'm not sure). It'd then probably make sense to add postgres to the list of stopwords, because it's bound to appear in nearly every mail. But wouldn't you want searched which include 'postgres*' to turn up empty? Quite certainly not. That improves recall for postgres:* query and certainly doesn't help other queries like post:*. But more importantly it affects precision for all queries like a:*, an:*, and:*, s:*, 't:*', the:*, etc (When that is the only search it also affects recall as no row matches an empty tsquery). Since stopwords are smaller, it means prefix search for a few characters is meaningless. And I would argue that is when the prefix search is more important -- only when you know a few characters. -Sushant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Postgres 9.1: Adding rows to table causing too much latency in other queries
I recently upgraded my postgres server from 9.0 to 9.1.2 and I am finding a peculiar problem.I have a program that periodically adds rows to this table using INSERT. Typically the number of rows is just 1-2 thousand when the table already has 500K rows. Whenever the program is adding rows, the performance of the search query on the same table is very bad. The query uses the gin index and the tsearch ranking function ts_rank_cd. This never happened earlier with postgres 9.0 Is there a known issue with Postgres 9.1? Or how to report this problem? -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Postgres 9.1: Adding rows to table causing too much latency in other queries
On Mon, 2011-12-19 at 19:08 +0200, Marti Raudsepp wrote: Another thought -- have you read about the GIN fast updates feature? This existed in 9.0 too. Instead of updating the index directly, GIN appends all changes to a sequential list, which needs to be scanned in whole for read queries. The periodic autovacuum process has to merge these values back into the index. Maybe the solution is to tune autovacuum to run more often on the table. http://www.postgresql.org/docs/9.1/static/gin-implementation.html Regards, Marti Probably this is the problem. Is running vacuum analyze under psql is the same as autovacuum? -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Postgres 9.1: Adding rows to table causing too much latency in other queries
On Mon, 2011-12-19 at 12:41 -0300, Euler Taveira de Oliveira wrote: On 19-12-2011 12:30, Sushant Sinha wrote: I recently upgraded my postgres server from 9.0 to 9.1.2 and I am finding a peculiar problem.I have a program that periodically adds rows to this table using INSERT. Typically the number of rows is just 1-2 thousand when the table already has 500K rows. Whenever the program is adding rows, the performance of the search query on the same table is very bad. The query uses the gin index and the tsearch ranking function ts_rank_cd. How bad is bad? It seems you are suffering from don't-fit-on-cache problem, no? The memory is 32GB and the entire database is just 22GB. Even vmstat 1 does not show any disk activity. I was not able to isolate the performance numbers since I have observed this only on the production box where the number of requests keep increasing as the box gets loaded. But a query that takes 1sec normally is taking more than 10secs (not sure whether it got the same number of CPU cycles). Is there a way to find that? -Sushant. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tsearch Parser Hacking
I agree that it will be a good idea to rewrite the entire thing. However, in the mean time, I sent a proposal earlier http://archives.postgresql.org/pgsql-hackers/2010-08/msg00019.php And a patch later: http://archives.postgresql.org/pgsql-hackers/2010-09/msg00476.php Tom asked me to look into Compound Word support but I found it not usable. Here was my response: http://archives.postgresql.org/pgsql-hackers/2011-01/msg00419.php I have not got any response since then, -Sushant. On Tue, Feb 15, 2011 at 9:33 AM, David E. Wheeler da...@kineticode.comwrote: On Feb 14, 2011, at 3:57 PM, Tom Lane wrote: There is zero, none, nada, provision for modifying the behavior of the default parser, other than by changing its compiled-in state transition tables. It doesn't help any that said tables are baroquely designed and utterly undocumented. IMO, sooner or later we need to trash that code and replace it with something a bit more modification-friendly. I was afraid you'd say that. Thanks. David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers