Hello Jörn,

Thank you for the quick response!
The data from the corpus I'm using already came with the punctuation
removed. I'll see what I could do about it.

And yes, what I'm ultimately planning to do is to train POS models for Zulu
and other related languages, and hopefully have them out for the community.

Mariya

On 19 June 2012 16:20, Jörn Kottmann <[email protected]> wrote:

> BTW, the pos data can be easily used to train an OpenNLP POS model.
>
> Jörn
>
>
> On 06/19/2012 04:17 PM, Jörn Kottmann wrote:
>
>> Hello,
>>
>> the sentence detector does end-of-sentence character
>> disambiguation. In your case all end-of-sentence characters
>> are proper end of sentences.
>>
>> So it only sees one outcome in your entire corpus. To train
>> a sentence detector model you need both cases, so it can learn
>> which are valid sentence ends, and which are not.
>>
>> The training fails on some internal validation, that should be done
>> with a nicer error message.
>>
>> I suggest to not remove the punctuation from your training sentences,
>> then it should work.
>>
>> HTH,
>> Jörn
>>
>> On 06/19/2012 04:03 PM, Mariya Koleva wrote:
>>
>>> Hi,
>>> I apologise if the question is trivial but I'm not experienced with
>>> openNLP
>>> (and not too confident in my Java skills either).
>>>
>>> I'm trying to train a sentence detection model for Zulu. No matter
>>> whether
>>> I'm using the command line interface or the API, it appears to be
>>> training
>>> but a model file is not created. I'm getting the following exception [1]:
>>> java.lang.**IllegalArgumentException: The maxent model is not
>>> compatible with
>>> the sentence detector!
>>>
>>> The original data comes from the Ukwabelana corpus [2] in a text file
>>> (US-ASCII), one sentence per line. It is completely stripped off of
>>> capitalisation and any kind of punctuation. I automatically added a "."
>>> at
>>> the end of every sentence, so that there is some EOS token for the
>>> program
>>> to pick up.
>>>
>>> I would appreciate any insight as to what is to be done!
>>>
>>> Mariya
>>>
>>> [1] The whole output is:
>>>
>>> Indexing events using cutoff of 5
>>>
>>>     Computing event counts… done. 29424 events
>>>     Indexing… done.
>>>     Sorting and merging events… done. Reduced 29424 events to 7830.
>>>     Done indexing.
>>>     Incorporating indexed data for training…
>>>     done.
>>>
>>>     Number of Event Tokens: 7830
>>>     Number of Outcomes: 1
>>>     Number of Predicates: 1673
>>>
>>>     …done.
>>>
>>>     Computing model parameters …
>>>     Performing 100 iterations.
>>>     1: … loglikelihood=0.0 1.0
>>>     2: … loglikelihood=0.0 1.0
>>>
>>>     Exception in thread “main” java.lang.**IllegalArgumentException: The
>>> maxent model is not compatible with the sentence detector!
>>>
>>>     at
>>> opennlp.tools.util.model.**BaseModel.checkArtifactMap(**
>>> BaseModel.java:275)
>>>     at opennlp.tools.sentdetect.**SentenceModel.<init>(**
>>> SentenceModel.java:64)
>>>     at
>>> opennlp.tools.sentdetect.**SentenceDetectorME.train(**SentenceDetectorME.java:285)
>>>
>>>     at
>>> opennlp.tools.sentdetect.**SentenceDetectorME.train(**SentenceDetectorME.java:296)
>>>
>>>     at
>>> opennlp.tools.cmdline.**sentdetect.**SentenceDetectorTrainerTool.**run(*
>>> *SentenceDetectorTrainerTool.**java:111)
>>>     at opennlp.tools.cmdline.CLI.**main(CLI.java:191)
>>>
>>>
>>> [2]
>>> http://www.cs.bris.ac.uk/**Research/MachineLearning/**
>>> Morphology/resources.jsp#**corpus<http://www.cs.bris.ac.uk/Research/MachineLearning/Morphology/resources.jsp#corpus>
>>>
>>>
>>
>

Reply via email to