Re: [HACKERS] Dictionary chaining and stop words

2007-08-29 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 There's clearly need for transforming a word and passing on the
 transformed version to the next dictionary. dict_thesaurus does exactly
 that by supporting a subdictionary which is called before invoking the
 thesaurus, but it should be generic capability not specific to any
 dictionary. Let's modify the lexize API so that a dictionary can:
 - Accept the word (and possibly input with something else)
 - Reject the word
 - Transform word into another (or pass on as is)

This doesn't seem to be enough to solve thesaurus' problem though.
The difficulty there is that (1) it wants to look at several words
at once, (2) it wants to know which words were rejected as stopwords.
If filtering happens before it then how can it do that?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Dictionary chaining and stop words

2007-08-29 Thread Oleg Bartunov

Heikki, we know about this ( I call it filtering), but we leave it for the
future after we'll have everything in core. The more demonstrative
example is well-known accent-removal problem. I used to recommend to
preprocess string before tsearch2, but there is a problem with
headline() when this will not work, so, clearly, we need accent removal
in dictionary chain using simple pg_unaccent dictionary, which should
return an original word without accent and then pass it to the next
dictionary. Currently, this is impossible. But, it's not obvious in the
general case, when dictionary return array of lexems. So, we decide
to leave it for future.

I'm very pleased, that we have now many developers interested in the
text search development ! We have many interesting todo like 'phrase search'.

Oleg

On Wed, 29 Aug 2007, Heikki Linnakangas wrote:


It's nice to be able to chain tsearch dictionaries, but I find that it's
not as flexible as it should be. Currently we have these dictionaries
built-in:

dict_simple - lowercases and checks against stop word list, accepts
everything not in stop word list
dict_synonym - replaces with synonym, if found
dict_thesaurus - similar to synonym, but can recognize phrases
dict_ispell - lowercases, checks dictionary, then checks stop words
dict_snowball - lowercases, checks stop words, then stems

The way things are at the moment, you can't for example use any of the
built-in dictionaries in case-sensitive mode, without writing custom C
code. Or check against stop words before going through an ispell
dictionary (dict_simple accepts everything, so you can't put it in front
of dict_ispell). Or use ispell dictionary first, then replace synonyms
with dict_synonym, and so forth.

To make the chaining more useful, I'm proposing some changes to
dictionary API and the set of built-in dictionaries. Currently, a
dictionary can either:
- Accept the word (and possibly replace it with something else)
- Reject the word
- Do nothing

There's clearly need for transforming a word and passing on the
transformed version to the next dictionary. dict_thesaurus does exactly
that by supporting a subdictionary which is called before invoking the
thesaurus, but it should be generic capability not specific to any
dictionary. Let's modify the lexize API so that a dictionary can:
- Accept the word (and possibly input with something else)
- Reject the word
- Transform word into another (or pass on as is)

If we do that, and modularize the lowercasing and stopwords
functionality into separate dictionaries, we end up with this nice,
orthogonal set of dictionaries that you can use as building blocks for a
wide range of more complex rules:

dict_lowercase  - lowercases, doesn't accept or reject anything
dict_simple - accepts or rejects (depending on dict option) words in
list, passes on others. This can be used for stop words functionality,
or to accept words found in a simple list of words
dict_accept - accepts everything (for use as a terminator in the chain,
if you want to accept everything not accepted or rejected by other
dictionaries)

dict_synonym- replaces input with synonym, passes on or accepts matches
depending on dict option
dict_thesaurus  - replaces input with preferred term, passes on or
accepts matches depending on dict option
dict_ispell - replaces input with basic form from dictionary, passes on
or accepts matches depending on dict option
dict_snowball   - replaces input with stem, passes on

I don't know what the current plan for beta is, but it would be nice to
get the API right even though there is some work to do. I can write a
patch if no-one objects.




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match