Re: AW: Lexical analysis tools for German language data

Markus Jelsma Thu, 12 Apr 2012 09:14:46 -0700

On Thursday 12 April 2012 18:00:14 Paul Libbrecht wrote:
> Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit :
> >> Some compounds probably should not be decompounded, like "Fahrrad"
> >> (farhren/Rad). With a dictionary-based stemmer, you might decide to
> >> avoid decompounding for words in the dictionary.
> > 
> > Good point.
> 
> More or less, Fahrrad is generally abbreviated as Rad.
> (even though Rad can mean wheel and bike)
> 
> >> Note that highlighting gets pretty weird when you are matching only
> >> part of a word.
> > 
> > Guess it'll be a weird when you get it wrong, like "Noten" in
> > "Notentriegelung".
> 
> This decomposition should not happen because Noten-triegelung does not have
> a correct second term.
> 
> >> The Basis Technology linguistic analyzers aren't cheap or small, but
> >> they work well.
> > 
> > We will consider our needs and options. Thanks for your thoughts.
> 
> My question remains as to which domain it aims at covering.
> We had such need for mathematics texts... I would be pleasantly surprised
> if, for example, Differenzen-quotient  would be decompounded.


The HyphenationCompoundWordTokenFilter can do those things but those words 
must be listed in the dictionary or you'll get strange results. It still 
yields strange results when it emits tokens that are subwords of a subword.

> 
> paul

-- 
Markus Jelsma - CTO - Openindex

Re: AW: Lexical analysis tools for German language data

Reply via email to