Once there was this project Multext,

  * MULTEXT (Multilingual Text Tools and Corpora) (1994) by Nancy Ide
, Jean Véronis, COLING'94
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.9202
  * project description
http://aune.lpl.univ-aix.fr/projects/multext/LEX/LEX2.html ; sorry the
description is in French. The idea was to define a uniq schema to
cover all language specifities

On Thu, Jan 24, 2013 at 12:41 PM, Jörn Kottmann <[email protected]> wrote:
> On 01/24/2013 11:59 AM, Renzo wrote:
>>
>> Hi all,
>> I'm pretty new to OpenNLP.
>> My interest is almost related to fetch document summaries using algorithms
>> such as TextRank.
>> This task requires sentence and token splitting - here's where OpenNLP
>> enters the game.
>> I also need some degree of POS to detect nouns, verbs and so on, in order
>> to add some linguistic support to the ranking process.
>>
>> It was fairly surprising to discover that noun tags - for example - are
>> language dependent. Thus an "isNoun" predicate needs a specific answer for
>> each language. It's "NN" for English, but it may be different for others.
>>
>> I just wonder if there is a common (e.g. language-independent) way to
>> answer such a kind of questions.
>>
>> Furthermore, is the logical format of available binary files documented
>> anywhere ? Is there any way to browse those files to inspect the used tag
>> list ?
>
>
>
> No, we did not write up a specification of our model formats. Tough, you can
> find lots of information about it in various places.
> All the models are zip files, which contain simple artifacts, e.g. xml
> dictionary, etc and maxent models. You can find the
> format explanation about the maxent models somewhere in maxent project, but
> usually that is used like a black box, because
> the model can't really be modified after training.
>
> Let us know if you have more questions about the formats, its probably
> easier when we discuss it component by component,
> depending on your needs.
>
> Tokenization, sentence splitting and the pos tagging are usually easy to get
> to perform nicely, especially when you do  some training.
> The existing models are mostly trained on news articles and might not
> perform that well on other domains.
>
> Jörn



-- 
Dr. Nicolas Hernandez
Associate Professor (Maître de Conférences)
Université de Nantes - LINA CNRS UMR 6241
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67

Reply via email to