AW: Lexical analysis tools for German language data

Michael Ludwig Thu, 12 Apr 2012 06:43:32 -0700

> Von: Valeriy Felberg

> If you want that query "jacke" matches a document containing the word
> "windjacke" or "kinderjacke", you could use a custom update processor.
> This processor could search the indexed text for words matching the
> pattern ".*jacke" and inject the word "jacke" into an additional field
> which you can search against. You would need a whole list of possible
> suffixes, of course.


Merci, Valeriy - I agree on the feasability of such an approach. The
list would likely have to be composed of the most frequently used terms
for your specific domain.

In our case, it's things people would buy in shops. Reducing overly
complicated and convoluted product descriptions to proper basic terms -
that would do the job. It's like going to a restaurant boasting fancy
and unintelligible names for the dishes you may order when they are
really just ordinary stuff like pork and potatoes.

Thinking some more about it, giving sufficient boost to the attached
category data might also do the job. That would shift the burden of
supplying proper semantics to the guys doing the categorization.

> It would slow down the update process but you don't need to split
> words during search.

> > Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
> >
> >> Given an input of "Windjacke" (probably "wind jacket" in English),
> >> I'd like the code that prepares the data for the index (tokenizer
> >> etc) to understand that this is a "Jacke" ("jacket") so that a
> >> query for "Jacke" would include the "Windjacke" document in its
> >> result set.

A query for "Windjacke" or "Kinderjacke" would probably not have to be
de-specialized to "Jacke" because, well, that's the user input and users
looking for specific things are probably doing so for a reason. If no
matches are found you can still tell them to just broaden their search.

Michael

AW: Lexical analysis tools for German language data

Reply via email to