Re: [HACKERS] Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
Tom Lane wrote: > Bruce Momjian <[EMAIL PROTECTED]> writes: > > Added to TODO: > > > * Allow text search dictionary to filter out only stop words > > > http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php > > That's a poor description. I thought the TODO was something more like > "allow dictionaries to change the token that is passed on to later > dictionaries". TODO updated as described. -- Bruce Momjian <[EMAIL PROTECTED]>http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [HACKERS] Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
Bruce Momjian <[EMAIL PROTECTED]> writes: > Added to TODO: > * Allow text search dictionary to filter out only stop words > http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php That's a poor description. I thought the TODO was something more like "allow dictionaries to change the token that is passed on to later dictionaries". regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
Added to TODO: * Allow text search dictionary to filter out only stop words http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php --- Tom Lane wrote: > Oleg Bartunov <[EMAIL PROTECTED]> writes: > > Let's consider one example - removing accents. > > In the past I always recommend people to use regex functions before > > to_tsvector conversion to remove accents, but recently I was noticed that > > such trick doesn't work with headline(). So, the only way is to have > > special dictionary dict_remove_accent before, which works as a filter. > > > I don't remember why do we left this for future releases, though. > > That would require a system-to-dictionary API change (to be able to > modify the token under inspection), no? So it's certainly something > I'd say is too late for 8.3. > > One thought that came to mind is that the option name should be just > "Accept" not "AcceptAll". To me "All" implies that it would accept > *everything* ... including stopwords. > > regards, tom lane > > ---(end of broadcast)--- > TIP 4: Have you searched our list archives? > >http://archives.postgresql.org -- Bruce Momjian <[EMAIL PROTECTED]>http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
Oleg Bartunov <[EMAIL PROTECTED]> writes: > On Wed, 14 Nov 2007, Tom Lane wrote: >> Huh? This is just an option for the "simple" dictionary, it's got >> nothing to do with thesaurus AFAICS. > I can assign simple dictionary as a normalization dictionary for thesaurus Sure. So what? You wouldn't use this option in that case. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
On Wed, 14 Nov 2007, Tom Lane wrote: Oleg Bartunov <[EMAIL PROTECTED]> writes: On Wed, 14 Nov 2007, Tom Lane wrote: Huh? This is just an option for the "simple" dictionary, it's got nothing to do with thesaurus AFAICS. I can assign simple dictionary as a normalization dictionary for thesaurus Sure. So what? You wouldn't use this option in that case. Right. That should be documented to avoid possible confusion. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
On Wed, 14 Nov 2007, Tom Lane wrote: Oleg Bartunov <[EMAIL PROTECTED]> writes: On Wed, 14 Nov 2007, Tom Lane wrote: One thought that came to mind is that the option name should be just "Accept" not "AcceptAll". To me "All" implies that it would accept *everything* ... including stopwords. wait, I remind the problem with filters. How it will works with thesaurus ? Huh? This is just an option for the "simple" dictionary, it's got nothing to do with thesaurus AFAICS. I can assign simple dictionary as a normalization dictionary for thesaurus Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes: >> This bit should be replaced with defGetBoolean. Otherwise it looks >> reasonably sane. > Fixed that, thank you. Applied with minor revisions (changed the parameter name, avoided probably-insignificant memory leak). regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
Oleg Bartunov <[EMAIL PROTECTED]> writes: > On Wed, 14 Nov 2007, Tom Lane wrote: >> One thought that came to mind is that the option name should be just >> "Accept" not "AcceptAll". To me "All" implies that it would accept >> *everything* ... including stopwords. > wait, I remind the problem with filters. How it will works with thesaurus ? Huh? This is just an option for the "simple" dictionary, it's got nothing to do with thesaurus AFAICS. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
On Wed, 14 Nov 2007, Tom Lane wrote: Oleg Bartunov <[EMAIL PROTECTED]> writes: Let's consider one example - removing accents. In the past I always recommend people to use regex functions before to_tsvector conversion to remove accents, but recently I was noticed that such trick doesn't work with headline(). So, the only way is to have special dictionary dict_remove_accent before, which works as a filter. I don't remember why do we left this for future releases, though. That would require a system-to-dictionary API change (to be able to modify the token under inspection), no? So it's certainly something It requires one reserved option for dictionaries and ability to get dictionary option. Unless somebody have dictionary with the same option, this change looks harmless. I'd say is too late for 8.3. yes, probably we get better idea. One thought that came to mind is that the option name should be just "Accept" not "AcceptAll". To me "All" implies that it would accept *everything* ... including stopwords. wait, I remind the problem with filters. How it will works with thesaurus ? Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
Oleg Bartunov <[EMAIL PROTECTED]> writes: > Let's consider one example - removing accents. > In the past I always recommend people to use regex functions before > to_tsvector conversion to remove accents, but recently I was noticed that > such trick doesn't work with headline(). So, the only way is to have > special dictionary dict_remove_accent before, which works as a filter. > I don't remember why do we left this for future releases, though. That would require a system-to-dictionary API change (to be able to modify the token under inspection), no? So it's certainly something I'd say is too late for 8.3. One thought that came to mind is that the option name should be just "Accept" not "AcceptAll". To me "All" implies that it would accept *everything* ... including stopwords. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
In principle the right way is to allow any dictionary have option like 'PassThrough' and internal function get_dict_options(dict, option) to check if PassThrough option is true. Let's consider one example - removing accents. In the past I always recommend people to use regex functions before to_tsvector conversion to remove accents, but recently I was noticed that such trick doesn't work with headline(). So, the only way is to have special dictionary dict_remove_accent before, which works as a filter. I don't remember why do we left this for future releases, though. Oleg On Wed, 14 Nov 2007, Tom Lane wrote: This patch: http://archives.postgresql.org/pgsql-patches/2007-11/msg00137.php seems simple and useful enough that I think we ought to slip it into 8.3, even though we are far past feature freeze. As the "simple" dictionary type stands in CVS HEAD, it is only useful as the last dictionary in a stack, since it never passes anything on as unrecognized. With the proposed AcceptAll = false option, it could be used to filter out some stopwords before feeding tokens to another dictionary. While most dictionary types have their own stopword support, some of them match stopwords after their own normalization processing, and so there's no way to filter on pre-normalized words. That seems like a good improvement, even without the specific need-example that Jan provided at the start of the thread. Normally we'd never consider adding a new feature so late in the development cycle, but this seems small enough and useful enough to make an exception. Comments? regards, tom lane Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
Tom Lane wrote: > This patch: > http://archives.postgresql.org/pgsql-patches/2007-11/msg00137.php > seems simple and useful enough that I think we ought to slip it into > 8.3, even though we are far past feature freeze. > > As the "simple" dictionary type stands in CVS HEAD, it is only useful as > the last dictionary in a stack, since it never passes anything on as > unrecognized. With the proposed AcceptAll = false option, it could be > used to filter out some stopwords before feeding tokens to another > dictionary. While most dictionary types have their own stopword support, > some of them match stopwords after their own normalization processing, > and so there's no way to filter on pre-normalized words. That seems > like a good improvement, even without the specific need-example that > Jan provided at the start of the thread. > > Normally we'd never consider adding a new feature so late in the > development cycle, but this seems small enough and useful enough > to make an exception. Comments? Agreed. The logic is that textsearch is getting a major overhaul in 8.3 and it is reasonable to keep adjusting things during beta. -- Bruce Momjian <[EMAIL PROTECTED]>http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
This patch: http://archives.postgresql.org/pgsql-patches/2007-11/msg00137.php seems simple and useful enough that I think we ought to slip it into 8.3, even though we are far past feature freeze. As the "simple" dictionary type stands in CVS HEAD, it is only useful as the last dictionary in a stack, since it never passes anything on as unrecognized. With the proposed AcceptAll = false option, it could be used to filter out some stopwords before feeding tokens to another dictionary. While most dictionary types have their own stopword support, some of them match stopwords after their own normalization processing, and so there's no way to filter on pre-normalized words. That seems like a good improvement, even without the specific need-example that Jan provided at the start of the thread. Normally we'd never consider adding a new feature so late in the development cycle, but this seems small enough and useful enough to make an exception. Comments? regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
> This bit should be replaced with defGetBoolean. Otherwise it looks > reasonably sane. Fixed that, thank you. Regards, Jan Urbanski -- Jan Urbanski GPG key ID: E583D7D2 ouden estin diff -Naur postgresql-8.3beta2-orig/doc/src/sgml/textsearch.sgml postgresql-8.3beta2/doc/src/sgml/textsearch.sgml --- postgresql-8.3beta2-orig/doc/src/sgml/textsearch.sgml 2007-10-27 02:19:45.0 +0200 +++ postgresql-8.3beta2/doc/src/sgml/textsearch.sgml2007-11-14 03:35:48.0 +0100 @@ -2090,9 +2090,10 @@ The simple dictionary template operates by converting the input token to lower case and checking it against a file of stop words. -If it is found in the file then NULL is returned, causing -the token to be discarded. If not, the lower-cased form of the word -is returned as the normalized lexeme. +If it is found in the file then an empty array is returned. If not, the +return value depends on the configuration. The default is to return the +lower-cased form of the word, but one might choose to +return NULL insead. @@ -2135,6 +2136,34 @@ + + We can also choose to return NULL insead of the lower-cased + lexeme if it is not found in the stop words file. This can be useful if + we just want to pass the unchanged lexeme to another dictionary instead + of reporting it as reckognized. We can control this behaviour through + the AcceptAll parameter. Correct values for this parameter + are true and false, the default + is true. + + + + Using the same configuration as in the previous example: + + +ALTER TEXT SEARCH DICTIONARY public.simple_dict ( AcceptAll = false ); + +SELECT ts_lexize('public.simple_dict','YeS'); + ts_lexize +--- + + +SELECT ts_lexize('public.simple_dict','The'); + ts_lexize +--- + {} + + + Most types of dictionaries rely on configuration files, such as files of diff -Naur postgresql-8.3beta2-orig/src/backend/tsearch/dict_simple.c postgresql-8.3beta2/src/backend/tsearch/dict_simple.c --- postgresql-8.3beta2-orig/src/backend/tsearch/dict_simple.c 2007-08-25 02:03:59.0 +0200 +++ postgresql-8.3beta2/src/backend/tsearch/dict_simple.c 2007-11-14 12:17:05.0 +0100 @@ -23,6 +23,7 @@ typedef struct { StopListstoplist; + boolacceptAll; } DictSimple; @@ -31,9 +32,12 @@ { List *dictoptions = (List *) PG_GETARG_POINTER(0); DictSimple *d = (DictSimple *) palloc0(sizeof(DictSimple)); - boolstoploaded = false; + boolstoploaded = false, + acceptloaded = false; ListCell *l; + d->acceptAll = true; + foreach(l, dictoptions) { DefElem*defel = (DefElem *) lfirst(l); @@ -47,6 +51,18 @@ readstoplist(defGetString(defel), &d->stoplist, lowerstr); stoploaded = true; } + else if (pg_strcasecmp("AcceptAll", defel->defname) == 0) + { + if (acceptloaded) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("multiple AcceptAll parameters"))); + if (defGetBoolean(defel)) + d->acceptAll = true; + else + d->acceptAll = false; + acceptloaded = true; + } else { ereport(ERROR, @@ -71,9 +87,18 @@ txt = lowerstr_with_len(in, len); if (*txt == '\0' || searchstoplist(&(d->stoplist), txt)) + { pfree(txt); + PG_RETURN_POINTER(res); + } else - res[0].lexeme = txt; - - PG_RETURN_POINTER(res); + { + if (d->acceptAll) + { + res[0].lexeme = txt; + PG_RETURN_POINTER(res); + } + else + PG_RETURN_POINTER(NULL); + } } signature.asc Description: OpenPGP digital signature
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
Jan Urbański wrote: > Great, I didn't know the API was that convenient in 8.3. I'll try > posting a working patch for 8.3 during the weekend. Here's the patch for 8.3beta2. As was suggested I added a configuration parameter to the 'simple' dictionary called AcceptAll so now it can work in two modes: either accept everything (the default) or do not reckognize anything (return NULL). Of couse stopwords are still being weeded out. The patch includes changes to the documentation (which was inconsistent by the way: it stated that the 'simple' dictionary returns NULL for stopwords, when in fact it returns an empty array). Regards, Jan Urbanski -- Jan Urbanski GPG key ID: E583D7D2 ouden estin diff -Naur postgresql-8.3beta2-orig/doc/src/sgml/textsearch.sgml postgresql-8.3beta2/doc/src/sgml/textsearch.sgml --- postgresql-8.3beta2-orig/doc/src/sgml/textsearch.sgml 2007-10-27 02:19:45.0 +0200 +++ postgresql-8.3beta2/doc/src/sgml/textsearch.sgml2007-11-14 03:35:48.0 +0100 @@ -2090,9 +2090,10 @@ The simple dictionary template operates by converting the input token to lower case and checking it against a file of stop words. -If it is found in the file then NULL is returned, causing -the token to be discarded. If not, the lower-cased form of the word -is returned as the normalized lexeme. +If it is found in the file then an empty array is returned. If not, the +return value depends on the configuration. The default is to return the +lower-cased form of the word, but one might choose to +return NULL insead. @@ -2135,6 +2136,34 @@ + + We can also choose to return NULL insead of the lower-cased + lexeme if it is not found in the stop words file. This can be useful if + we just want to pass the unchanged lexeme to another dictionary instead + of reporting it as reckognized. We can control this behaviour through + the AcceptAll parameter. Correct values for this parameter + are true and false, the default + is true. + + + + Using the same configuration as in the previous example: + + +ALTER TEXT SEARCH DICTIONARY public.simple_dict ( AcceptAll = false ); + +SELECT ts_lexize('public.simple_dict','YeS'); + ts_lexize +--- + + +SELECT ts_lexize('public.simple_dict','The'); + ts_lexize +--- + {} + + + Most types of dictionaries rely on configuration files, such as files of diff -Naur postgresql-8.3beta2-orig/src/backend/tsearch/dict_simple.c postgresql-8.3beta2/src/backend/tsearch/dict_simple.c --- postgresql-8.3beta2-orig/src/backend/tsearch/dict_simple.c 2007-08-25 02:03:59.0 +0200 +++ postgresql-8.3beta2/src/backend/tsearch/dict_simple.c 2007-11-14 03:39:45.0 +0100 @@ -23,6 +23,7 @@ typedef struct { StopListstoplist; + boolacceptAll; } DictSimple; @@ -31,8 +32,12 @@ { List *dictoptions = (List *) PG_GETARG_POINTER(0); DictSimple *d = (DictSimple *) palloc0(sizeof(DictSimple)); - boolstoploaded = false; + boolstoploaded = false, + acceptloaded = false; ListCell *l; + const char *defstring; + + d->acceptAll = true; foreach(l, dictoptions) { @@ -47,6 +52,24 @@ readstoplist(defGetString(defel), &d->stoplist, lowerstr); stoploaded = true; } + else if (pg_strcasecmp("AcceptAll", defel->defname) == 0) + { + if (acceptloaded) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("multiple AcceptAll parameters"))); + defstring = defGetString(defel); + if (pg_strcasecmp(defstring, "True") == 0) + d->acceptAll = true; + else if (pg_strcasecmp(defstring, "False") == 0) + d->acceptAll = false; + else + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("invalid value for AcceptAll parameter: \"%s\"", + defstring))); + acceptloaded = true; + } else { ereport(ERROR, @@ -71,9 +94,18 @@ txt = lowerstr_with_len(in, len); if (*txt == '\0' || searchstoplist(&(d->stoplist), txt)) + { pfree(txt); + PG_RETURN_POINTER(res); + } else - res[0].lexeme = txt; - - PG_RETURN_POINTER(res); + { + if (d->acceptAll) + { +
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
> + defstring = defGetString(defel); > + if (pg_strcasecmp(defstring, "True") == 0) > + d->acceptAll = true; > + else if (pg_strcasecmp(defstring, "False") == 0) > + d->acceptAll = false; > + else > + ereport(ERROR, > + > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid value for AcceptAll > parameter: \"%s\"", > + defstring))); This bit should be replaced with defGetBoolean. Otherwise it looks reasonably sane. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
> That doesn't have a whole lot to do with where we are today: > http://developer.postgresql.org/pgdocs/postgres/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY > http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/tsearch/dict_simple.c Great, I didn't know the API was that convenient in 8.3. I'll try posting a working patch for 8.3 during the weekend. Regards, -- Jan Urbanski GPG key ID: E583D7D2 ouden estin signature.asc Description: OpenPGP digital signature
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes: >> If there is a use-case for it, IMHO it'd be better to add a boolean >> accept-or-pass-on parameter to the "simple" dictionary than to add a >> whole new dictionary type. > Ah, I never thought of it. You may be very right - it does look like an > easier solution. However, it would require coding some basic parsing > logic into the dex_init procedure, because right now the 'simple' > dictionary expects dict_initoption to be a path to the stopwords file. That doesn't have a whole lot to do with where we are today: http://developer.postgresql.org/pgdocs/postgres/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/tsearch/dict_simple.c regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
> This example still doesn't seem very convincing --- why would you not > merely attach the stopword list to the pl_ispell dictionary? Because the ispell-based dictionaries first stem the lexeme and then search for it in the stopwords file. The situation here is that a stopword is first stemmed to produce another lexeme (which is not in the stopwords file, as it's a perfectly valid word) and then gets indexed, instead of being discarded. To restate: the word 'od' in Polish is both a preposition and a declined form of the noun 'oda'. The ispell dictionary when passed the lexeme 'od' first stems it to produce 'oda' and then fails to find it in the stopwords file. If I'd include the word 'oda' in the stopwords file, I'd be losing information about the noun 'oda' appearing in documents. I'm still trying to find an English example, as I'm sure it would be easier to understand by most readers of this list. Nothing comes to my mind, however - I guess some languages just have rotten luck with their grammar. > If there is a use-case for it, IMHO it'd be better to add a boolean > accept-or-pass-on parameter to the "simple" dictionary than to add a > whole new dictionary type. Ah, I never thought of it. You may be very right - it does look like an easier solution. However, it would require coding some basic parsing logic into the dex_init procedure, because right now the 'simple' dictionary expects dict_initoption to be a path to the stopwords file. Do you mean something like 'StopFile="/path/to/stopwords", AcceptUnknown=0'" ? Regards, Jan Urbanski -- Jan Urbanski GPG key ID: E583D7D2 ouden estin signature.asc Description: OpenPGP digital signature
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
> dictionaries. In this case, you would first check against one stopword > list, eliminating 'od', then check the ispell dictionary, and then check > another stopword list without 'od'. My problem is basically solved using the patch I sent earlier. I use '{stop, pl_ispell, simple}' which has the effect of: a) eliminating words that are stopwords but stemmed produce non-stopwords (such as 'od', that gets stemmed to 'oda') b) stemming non-stopwords properly (using an ispell dictionary) c) indexing words that are not reckognized by ispell, (for instance 'postgresql' gets indexed as 'postgresql') > I suggested that a while ago > (http://archives.postgresql.org/pgsql-hackers/2007-08/msg01036.php). > Hopefully Oleg or someone else gets around restructuring the > dictionaries in a future release. I'm gald to see I'm not the only one who is in need of a more sophisticated way of dealing with dictionaries chaining. I understand however the problems that arise when one wants to extend the dictionary API beyond the reject/accept/pass-on schema. For these three we have an easy way of passing the result from lexize - it returns an empty array, an array of stemmed lexemes or NULL. If more complex actions were to be taken, I'm afraid lexize would have to return something more complex than just text[]. > I wonder if you could hack the ispell dictionary file to treat oda > specially? I thought about it, but it turned out that writing a custom dictionary was easier than figuring out how ispell works internally. Regards, -- Jan Urbanski GPG key ID: E583D7D2 ouden estin signature.asc Description: OpenPGP digital signature
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
Jan Urbański wrote: The solution I came up with was simple: write a dictionary, that does only one thing: looks up the lexeme in a stopwords file and either discards it or returns NULL. Doesn't the "simple" dictionary handle this? I don't think so. The 'simple' dictionary discards stopwords, but accepts any other lexemes. So if use {'simple', 'pl_ispell'} for my config, I'll get rid of the stopwords, but I won't get any lexemes stemmed by ispell. Every lexeme that's not a stopword will produce the very same lexeme (this is how I think the 'simple' dictionary works). My dictionary does basically the same thing as the 'simple' dictionary, but it returns NULL instead of the original lexeme in case the lexeme is not found in the stopwords file. In the long term, what we really need a more flexible way to chain dictionaries. In this case, you would first check against one stopword list, eliminating 'od', then check the ispell dictionary, and then check another stopword list without 'od'. I suggested that a while ago (http://archives.postgresql.org/pgsql-hackers/2007-08/msg01036.php). Hopefully Oleg or someone else gets around restructuring the dictionaries in a future release. I wonder if you could hack the ispell dictionary file to treat oda specially? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
>> The solution I came up with was simple: write a dictionary, that does >> only one thing: looks up the lexeme in a stopwords file and either >> discards it or returns NULL. > > Doesn't the "simple" dictionary handle this? I don't think so. The 'simple' dictionary discards stopwords, but accepts any other lexemes. So if use {'simple', 'pl_ispell'} for my config, I'll get rid of the stopwords, but I won't get any lexemes stemmed by ispell. Every lexeme that's not a stopword will produce the very same lexeme (this is how I think the 'simple' dictionary works). My dictionary does basically the same thing as the 'simple' dictionary, but it returns NULL instead of the original lexeme in case the lexeme is not found in the stopwords file. Regards, -- Jan Urbanski GPG key ID: E583D7D2 ouden estin signature.asc Description: OpenPGP digital signature
Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes: > The solution I came up with was simple: write a dictionary, that does > only one thing: looks up the lexeme in a stopwords file and either > discards it or returns NULL. Doesn't the "simple" dictionary handle this? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match