subject:"Lexical analysis tools for German language data"

Re: Lexical analysis tools for German language data

2012-04-13 Thread Tomas Zerolo

On Thu, Apr 12, 2012 at 03:46:56PM +, Michael Ludwig wrote: Von: Walter Underwood German noun decompounding is a little more complicated than it might seem. There can be transformations or inflections, like the s in Weinachtsbaum (Weinachten/Baum). I remember from my

AW: Lexical analysis tools for German language data

2012-04-13 Thread Michael Ludwig

Von: Tomas Zerolo There can be transformations or inflections, like the s in Weinachtsbaum (Weinachten/Baum). I remember from my linguistics studies that the terminus technicus for these is Fugenmorphem (interstitial or joint morpheme) [...] IANAL (I am not a linguist -- pun

Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig

Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig

Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an

Re: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht

Michael, I'm on this list and the lucene list since several years and have not found this yet. It's been one neglected topics to my taste. There is a CompoundAnalyzer but it requires the compounds to be dictionary based, as you indicate. I am convinced there's a way to build the

Re: Lexical analysis tools for German language data

2012-04-12 Thread Bernd Fehling

You might have a look at: http://www.basistech.com/lucene/ Am 12.04.2012 11:52, schrieb Michael Ludwig: Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a

Re: Lexical analysis tools for German language data

2012-04-12 Thread Valeriy Felberg

If you want that query jacke matches a document containing the word windjacke or kinderjacke, you could use a custom update processor. This processor could search the indexed text for words matching the pattern .*jacke and inject the word jacke into an additional field which you can search

Re: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht

Bernd, can you please say a little more? I think this list is ok to contain some description for commercial solutions that satisfy a request formulated on list. Is there any product at BASIS Tech that provides a compound-analyzer with a big dictionary of decomposed compounds in German? If yes,

Re: Lexical analysis tools for German language data

2012-04-12 Thread Bernd Fehling

Paul, nearly two years ago I requested an evaluation license and tested BASIS Tech Rosette for Lucene Solr. Was working excellent but the price much much to high. Yes, they also have compound analysis for several languages including German. Just configure your pipeline in solr and setup the

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig

Von: Valeriy Felberg If you want that query jacke matches a document containing the word windjacke or kinderjacke, you could use a custom update processor. This processor could search the indexed text for words matching the pattern .*jacke and inject the word jacke into an additional field

Re: Lexical analysis tools for German language data

2012-04-12 Thread Markus Jelsma

Hi, We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a from TeX generated FOP XML file for the Dutch language and have seen decent results. A bonus was that now some tokens can be stemmed properly because not all compounds are listed in the dictionary for the

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig

Von: Markus Jelsma We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a from TeX generated FOP XML file for the Dutch language and have seen decent results. A bonus was that now some tokens can be stemmed properly because not all compounds are listed in the

Re: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood

German noun decompounding is a little more complicated than it might seem. There can be transformations or inflections, like the s in Weinachtsbaum (Weinachten/Baum). Internal nouns should be recapitalized, like Baum above. Some compounds probably should not be decompounded, like Fahrrad

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig

Von: Walter Underwood German noun decompounding is a little more complicated than it might seem. There can be transformations or inflections, like the s in Weinachtsbaum (Weinachten/Baum). I remember from my linguistics studies that the terminus technicus for these is Fugenmorphem

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht

Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit : Some compounds probably should not be decompounded, like Fahrrad (farhren/Rad). With a dictionary-based stemmer, you might decide to avoid decompounding for words in the dictionary. Good point. More or less, Fahrrad is generally abbreviated

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood

On Apr 12, 2012, at 8:46 AM, Michael Ludwig wrote: I remember from my linguistics studies that the terminus technicus for these is Fugenmorphem (interstitial or joint morpheme). That is some excellent linguistic jargon. I'll file that with hapax legomenon. If you don't highlight, you can get

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Markus Jelsma

On Thursday 12 April 2012 18:00:14 Paul Libbrecht wrote: Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit : Some compounds probably should not be decompounded, like Fahrrad (farhren/Rad). With a dictionary-based stemmer, you might decide to avoid decompounding for words in the dictionary.

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood

On Apr 12, 2012, at 9:00 AM, Paul Libbrecht wrote: More or less, Fahrrad is generally abbreviated as Rad. (even though Rad can mean wheel and bike) A synonym could handle this, since farhren would not be a good match. It is judgement call, but this seems more like an equivalence Fahrrad = Rad

Re: Lexical analysis tools for German language data

AW: Lexical analysis tools for German language data

Lexical analysis tools for German language data

AW: Lexical analysis tools for German language data

Re: Lexical analysis tools for German language data

Re: Lexical analysis tools for German language data

Re: Lexical analysis tools for German language data

Re: Lexical analysis tools for German language data

Re: Lexical analysis tools for German language data

AW: Lexical analysis tools for German language data

Re: Lexical analysis tools for German language data

AW: Lexical analysis tools for German language data

Re: Lexical analysis tools for German language data

AW: Lexical analysis tools for German language data

Re: AW: Lexical analysis tools for German language data

Re: AW: Lexical analysis tools for German language data

Re: AW: Lexical analysis tools for German language data

Re: AW: Lexical analysis tools for German language data

18 matches

Site Navigation

Mail list logo

Footer information