Re: How to use the lemmatizer
Hi, thanks for the feedback. I'm really only interested in getting a convenient access to the dictionaries, so that I can use them for lemmatization. For this particular task, I'm not using any other functionality from LanguageTool, including grammatical rules. So here is what I do: - run a probabilistic POS tagger - feed the tokens of my text to the languagetool tagger to get all dictionary entries - find a match between the POS tag created by the probabilistic tagger and the returned dictionary entries - if there is a match, use the respective lemma Matching is the most annoying part, because the tagset used by the probabilistic tagger may not be the same as the one used in the LanguageTool dictionary. So now try three matching approaches: - checking if the POS tag from tagger and the one from dictionary are exactly the same? - checking if the POS tag from the tagger is the same as the first element of the dictionary tag (splitting by ':') - using mapping tables to map both, the tag from the POS tagger and the tag from the dictionary, to a coarse-grained scheme of word classes and see if they match there Seems to work quite ok. Cheers, -- Richard On 27.01.2014, at 22:38, Marcin Miłkowski list-addr...@wp.pl wrote: Hello, W dniu 2014-01-27 15:44, Richard Eckart de Castilho pisze: Hello everybody, I may be totally wrong, but I believe the lemmatizers in LanguageTool are implemented based on dictionaries. I suppose a dictionary entry would be made up of a form, a lemma, and a pos tag. Assuming this is correct, is there a simple way to do a lookup in such a dictionary? Also, is there a way to find out which tagsets are used by these dictionaries (or maybe there is even some standard in LanguageTool, e.g. verbs are always V and nouns are always N or something like that)? I would like a method that accepts an inflected form and a pos tag and that returns a single lemma. Currently, I am doing this, but it seems a bit awkward. ListAnalyzedTokenReadings rawTaggedTokens = lang.getTagger().tag(tokenText); AnalyzedSentence as = new AnalyzedSentence( rawTaggedTokens.toArray(new AnalyzedTokenReadings[rawTaggedTokens.size()])); as = lang.getDisambiguator().disambiguate(as); String best = getMostFrequentLemma(as.getTokens()[i]); In particular, I would like to use a different POS tagger. I have various statistical POS taggers at my disposal that produce a single POS per token - and that is what I want. The LanguageTool POS tagger produces multiple unranked POS tags per token. Beware that statistical POS taggers will necessarily obfuscate non-grammatical material, as they try to guess the correct tags. This makes them quite useless for writing rules. We've been there, tried that. I haven't yet found a decent English POS tagger, for example, that would be useful. Note however that if you have frequency info, you can add it to your tagger dictionary. And we indeed can do so using typing frequency lists, so you'd be able to assign the most frequent lemma if you need, I guess. The procedure is described here: http://wiki.languagetool.org/hunspell-support See under including frequency data. Regards, Marcin -- WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: How to use the lemmatizer
On 2014-01-27 15:44, Richard Eckart de Castilho wrote: I would like a method that accepts an inflected form and a pos tag and that returns a single lemma. A language-specific list of POS tags is available in the synthesizer (if the language has one) as BaseSynthesizer.possibleTags (it's not public but protected). You can use your own tagger by overwriting getTagger() in Language.java but its tags would need to be the same as the original ones, otherwise rules wouldn't match anymore. To answer the original question, I don't know of a cleaner way to lookup single words either. Regards Daniel -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
How to use the lemmatizer
Hello everybody, I may be totally wrong, but I believe the lemmatizers in LanguageTool are implemented based on dictionaries. I suppose a dictionary entry would be made up of a form, a lemma, and a pos tag. Assuming this is correct, is there a simple way to do a lookup in such a dictionary? Also, is there a way to find out which tagsets are used by these dictionaries (or maybe there is even some standard in LanguageTool, e.g. verbs are always V and nouns are always N or something like that)? I would like a method that accepts an inflected form and a pos tag and that returns a single lemma. Currently, I am doing this, but it seems a bit awkward. ListAnalyzedTokenReadings rawTaggedTokens = lang.getTagger().tag(tokenText); AnalyzedSentence as = new AnalyzedSentence( rawTaggedTokens.toArray(new AnalyzedTokenReadings[rawTaggedTokens.size()])); as = lang.getDisambiguator().disambiguate(as); String best = getMostFrequentLemma(as.getTokens()[i]); In particular, I would like to use a different POS tagger. I have various statistical POS taggers at my disposal that produce a single POS per token - and that is what I want. The LanguageTool POS tagger produces multiple unranked POS tags per token. Cheers, -- Richard -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: How to use the lemmatizer
Hello, W dniu 2014-01-27 15:44, Richard Eckart de Castilho pisze: Hello everybody, I may be totally wrong, but I believe the lemmatizers in LanguageTool are implemented based on dictionaries. I suppose a dictionary entry would be made up of a form, a lemma, and a pos tag. Assuming this is correct, is there a simple way to do a lookup in such a dictionary? Also, is there a way to find out which tagsets are used by these dictionaries (or maybe there is even some standard in LanguageTool, e.g. verbs are always V and nouns are always N or something like that)? I would like a method that accepts an inflected form and a pos tag and that returns a single lemma. Currently, I am doing this, but it seems a bit awkward. ListAnalyzedTokenReadings rawTaggedTokens = lang.getTagger().tag(tokenText); AnalyzedSentence as = new AnalyzedSentence( rawTaggedTokens.toArray(new AnalyzedTokenReadings[rawTaggedTokens.size()])); as = lang.getDisambiguator().disambiguate(as); String best = getMostFrequentLemma(as.getTokens()[i]); In particular, I would like to use a different POS tagger. I have various statistical POS taggers at my disposal that produce a single POS per token - and that is what I want. The LanguageTool POS tagger produces multiple unranked POS tags per token. Beware that statistical POS taggers will necessarily obfuscate non-grammatical material, as they try to guess the correct tags. This makes them quite useless for writing rules. We've been there, tried that. I haven't yet found a decent English POS tagger, for example, that would be useful. Note however that if you have frequency info, you can add it to your tagger dictionary. And we indeed can do so using typing frequency lists, so you'd be able to assign the most frequent lemma if you need, I guess. The procedure is described here: http://wiki.languagetool.org/hunspell-support See under including frequency data. Regards, Marcin -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel