Hi, We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a from TeX generated FOP XML file for the Dutch language and have seen decent results. A bonus was that now some tokens can be stemmed properly because not all compounds are listed in the dictionary for the HunspellStemFilter.
It does introduce a recall/precision problem but it at least returns results for those many users that do not properly use compounds in their search query. There seem to be a small issue with the filter where minSubwordSize=N yields subwords of size N-1. Cheers, On Thursday 12 April 2012 12:39:44 Paul Libbrecht wrote: > Michael, > > I'm on this list and the lucene list since several years and have not found > this yet. It's been one "neglected topics" to my taste. > > There is a CompoundAnalyzer but it requires the compounds to be dictionary > based, as you indicate. > > I am convinced there's a way to build the de-compounding words efficiently > from a broad corpus but I have never seen it (and the experts at DFKI I > asked for for also told me they didn't know of one). > > paul > > Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit : > > Given an input of "Windjacke" (probably "wind jacket" in English), I'd > > like the code that prepares the data for the index (tokenizer etc) to > > understand that this is a "Jacke" ("jacket") so that a query for "Jacke" > > would include the "Windjacke" document in its result set. > > > > It appears to me that such an analysis requires a dictionary-backed > > approach, which doesn't have to be perfect at all; a list of the most > > common 2000 words would probably do the job and fulfil a criterion of > > reasonable usefulness. > > > > Do you know of any implementation techniques or working implementations > > to do this kind of lexical analysis for German language data? (Or other > > languages, for that matter?) What are they, where can I find them? > > > > I'm sure there is something out (commercial or free) because I've seen > > lots of engines grokking German and the way it builds words. > > > > Failing that, what are the proper terms do refer to these techniques so > > you can search more successfully? > > > > Michael -- Markus Jelsma - CTO - Openindex