Hi,

We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a 
from TeX generated FOP XML file for the Dutch language and have seen decent 
results. A bonus was that now some tokens can be stemmed properly because not 
all compounds are listed in the dictionary for the HunspellStemFilter.

It does introduce a recall/precision problem but it at least returns results 
for those many users that do not properly use compounds in their search query.

There seem to be a small issue with the filter where minSubwordSize=N yields 
subwords of size N-1.

Cheers,

On Thursday 12 April 2012 12:39:44 Paul Libbrecht wrote:
> Michael,
> 
> I'm on this list and the lucene list since several years and have not found
> this yet. It's been one "neglected topics" to my taste.
> 
> There is a CompoundAnalyzer but it requires the compounds to be dictionary
> based, as you indicate.
> 
> I am convinced there's a way to build the de-compounding words efficiently
> from a broad corpus but I have never seen it (and the experts at DFKI I
> asked for for also told me they didn't know of one).
> 
> paul
> 
> Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
> > Given an input of "Windjacke" (probably "wind jacket" in English), I'd
> > like the code that prepares the data for the index (tokenizer etc) to
> > understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
> > would include the "Windjacke" document in its result set.
> > 
> > It appears to me that such an analysis requires a dictionary-backed
> > approach, which doesn't have to be perfect at all; a list of the most
> > common 2000 words would probably do the job and fulfil a criterion of
> > reasonable usefulness.
> > 
> > Do you know of any implementation techniques or working implementations
> > to do this kind of lexical analysis for German language data? (Or other
> > languages, for that matter?) What are they, where can I find them?
> > 
> > I'm sure there is something out (commercial or free) because I've seen
> > lots of engines grokking German and the way it builds words.
> > 
> > Failing that, what are the proper terms do refer to these techniques so
> > you can search more successfully?
> > 
> > Michael

-- 
Markus Jelsma - CTO - Openindex

Reply via email to