Re: How to use the lemmatizer

2014-01-29 Thread Richard Eckart de Castilho
Hi,

thanks for the feedback. I'm really only interested in getting a convenient 
access to the dictionaries, so that I can use them for lemmatization. For this 
particular task, I'm not using any other functionality from LanguageTool, 
including grammatical rules.

So here is what I do:

- run a probabilistic POS tagger
- feed the tokens of my text to the languagetool tagger to get all dictionary 
entries
- find a match between the POS tag created by the probabilistic tagger and the 
returned dictionary entries
- if there is a match, use the respective lemma

Matching is the most annoying part, because the tagset used by the 
probabilistic tagger may not be the same as the one used in the LanguageTool 
dictionary. So now try three matching approaches:

- checking if the POS tag from tagger and the one from dictionary are exactly 
the same?
- checking if the POS tag from the tagger is the same as the first element of 
the dictionary tag (splitting by ':')
- using mapping tables to map both, the tag from the POS tagger and the tag 
from the dictionary, to a coarse-grained scheme of word classes and see if they 
match there

Seems to work quite ok.

Cheers,

-- Richard

On 27.01.2014, at 22:38, Marcin Miłkowski list-addr...@wp.pl wrote:

 Hello,
 
 W dniu 2014-01-27 15:44, Richard Eckart de Castilho pisze:
 Hello everybody,
 
 I may be totally wrong, but I believe the lemmatizers in LanguageTool are 
 implemented based on dictionaries. I suppose a dictionary entry would be 
 made up of a form, a lemma, and a pos tag.
 
 Assuming this is correct, is there a simple way to do a lookup in such a 
 dictionary?
 
 Also, is there a way to find out which tagsets are used by these 
 dictionaries (or maybe there is even some standard in LanguageTool, e.g. 
 verbs are always V and nouns are always N or something like that)?
 
 I would like a method that accepts an inflected form and a pos tag and that 
 returns a single lemma.
 
 
 Currently, I am doing this, but it seems a bit awkward.
 
 ListAnalyzedTokenReadings rawTaggedTokens = 
 lang.getTagger().tag(tokenText);
 AnalyzedSentence as = new AnalyzedSentence(
   rawTaggedTokens.toArray(new 
 AnalyzedTokenReadings[rawTaggedTokens.size()]));
 as = lang.getDisambiguator().disambiguate(as);
 String best = getMostFrequentLemma(as.getTokens()[i]);
 
 In particular, I would like to use a different POS tagger. I have various 
 statistical POS taggers at my disposal that produce a single POS per token - 
 and that is what I want. The LanguageTool POS tagger produces multiple 
 unranked POS tags per token.
 
 Beware that statistical POS taggers will necessarily obfuscate 
 non-grammatical material, as they try to guess the correct tags. This 
 makes them quite useless for writing rules. We've been there, tried 
 that. I haven't yet found a decent English POS tagger, for example, that 
 would be useful.
 
 Note however that if you have frequency info, you can add it to your 
 tagger dictionary. And we indeed can do so using typing frequency lists, 
 so you'd be able to assign the most frequent lemma if you need, I guess. 
 The procedure is described here:
 
 http://wiki.languagetool.org/hunspell-support
 
 See under including frequency data.
 
 Regards,
 Marcin


--
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: How to use the lemmatizer

2014-01-27 Thread Daniel Naber
On 2014-01-27 15:44, Richard Eckart de Castilho wrote:

 I would like a method that accepts an inflected form and a pos tag and
 that returns a single lemma.

A language-specific list of POS tags is available in the synthesizer (if 
the language has one) as BaseSynthesizer.possibleTags (it's not public 
but protected).

You can use your own tagger by overwriting getTagger() in 
Language.java but its tags would need to be the same as the original 
ones, otherwise rules wouldn't match anymore.

To answer the original question, I don't know of a cleaner way to lookup 
single words either.

Regards
  Daniel


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


How to use the lemmatizer

2014-01-27 Thread Richard Eckart de Castilho
Hello everybody,

I may be totally wrong, but I believe the lemmatizers in LanguageTool are 
implemented based on dictionaries. I suppose a dictionary entry would be made 
up of a form, a lemma, and a pos tag.

Assuming this is correct, is there a simple way to do a lookup in such a 
dictionary? 

Also, is there a way to find out which tagsets are used by these dictionaries 
(or maybe there is even some standard in LanguageTool, e.g. verbs are always V 
and nouns are always N or something like that)?

I would like a method that accepts an inflected form and a pos tag and that 
returns a single lemma.


Currently, I am doing this, but it seems a bit awkward.

ListAnalyzedTokenReadings rawTaggedTokens = lang.getTagger().tag(tokenText);
AnalyzedSentence as = new AnalyzedSentence(
  rawTaggedTokens.toArray(new AnalyzedTokenReadings[rawTaggedTokens.size()]));
as = lang.getDisambiguator().disambiguate(as);
String best = getMostFrequentLemma(as.getTokens()[i]);

In particular, I would like to use a different POS tagger. I have various 
statistical POS taggers at my disposal that produce a single POS per token - 
and that is what I want. The LanguageTool POS tagger produces multiple unranked 
POS tags per token.

Cheers,

-- Richard
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: How to use the lemmatizer

2014-01-27 Thread Marcin Miłkowski
Hello,

W dniu 2014-01-27 15:44, Richard Eckart de Castilho pisze:
 Hello everybody,

 I may be totally wrong, but I believe the lemmatizers in LanguageTool are 
 implemented based on dictionaries. I suppose a dictionary entry would be made 
 up of a form, a lemma, and a pos tag.

 Assuming this is correct, is there a simple way to do a lookup in such a 
 dictionary?

 Also, is there a way to find out which tagsets are used by these dictionaries 
 (or maybe there is even some standard in LanguageTool, e.g. verbs are always 
 V and nouns are always N or something like that)?

 I would like a method that accepts an inflected form and a pos tag and that 
 returns a single lemma.


 Currently, I am doing this, but it seems a bit awkward.

 ListAnalyzedTokenReadings rawTaggedTokens = lang.getTagger().tag(tokenText);
 AnalyzedSentence as = new AnalyzedSentence(
rawTaggedTokens.toArray(new 
 AnalyzedTokenReadings[rawTaggedTokens.size()]));
 as = lang.getDisambiguator().disambiguate(as);
 String best = getMostFrequentLemma(as.getTokens()[i]);

 In particular, I would like to use a different POS tagger. I have various 
 statistical POS taggers at my disposal that produce a single POS per token - 
 and that is what I want. The LanguageTool POS tagger produces multiple 
 unranked POS tags per token.

Beware that statistical POS taggers will necessarily obfuscate 
non-grammatical material, as they try to guess the correct tags. This 
makes them quite useless for writing rules. We've been there, tried 
that. I haven't yet found a decent English POS tagger, for example, that 
would be useful.

Note however that if you have frequency info, you can add it to your 
tagger dictionary. And we indeed can do so using typing frequency lists, 
so you'd be able to assign the most frequent lemma if you need, I guess. 
The procedure is described here:

http://wiki.languagetool.org/hunspell-support

See under including frequency data.

Regards,
Marcin

--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel