On 01/24/2013 11:59 AM, Renzo wrote:
Hi all,
I'm pretty new to OpenNLP.
My interest is almost related to fetch document summaries using
algorithms such as TextRank.
This task requires sentence and token splitting - here's where OpenNLP
enters the game.
I also need some degree of POS to detect nouns, verbs and so on, in
order to add some linguistic support to the ranking process.
It was fairly surprising to discover that noun tags - for example -
are language dependent. Thus an "isNoun" predicate needs a specific
answer for each language. It's "NN" for English, but it may be
different for others.
I just wonder if there is a common (e.g. language-independent) way to
answer such a kind of questions.
Furthermore, is the logical format of available binary files
documented anywhere ? Is there any way to browse those files to
inspect the used tag list ?
No, we did not write up a specification of our model formats. Tough, you
can find lots of information about it in various places.
All the models are zip files, which contain simple artifacts, e.g. xml
dictionary, etc and maxent models. You can find the
format explanation about the maxent models somewhere in maxent project,
but usually that is used like a black box, because
the model can't really be modified after training.
Let us know if you have more questions about the formats, its probably
easier when we discuss it component by component,
depending on your needs.
Tokenization, sentence splitting and the pos tagging are usually easy to
get to perform nicely, especially when you do some training.
The existing models are mostly trained on news articles and might not
perform that well on other domains.
Jörn