Hello,

ok i checked in the sources now and i can see that the tokenizer skips
further tokenisation if a certain pattern is matched. I looked also into
the default values of the patterns. As far as i can see within the
Fatory.java there is no pattern for the "de" language flag. For me this
means that there is no standard way of training a german model with the
TokenizerTrainer tool. I guess i have to write my own training tool
where i set the pattern within the TokenizerME myself. Am i right here?

Finally i wonder if the training class for the de-token.bin file on the
models page is public so that i can adopt it for my own data. If anyone
can point me there this would be very helpful.

Thank you

Andreas

Am 13.03.2013 12:15, schrieb James Kosin:
> Andreas,
> 
> Tokenizing is a very simple procedure; so, the default of 100 iterations
> should suffice as long as you have a large training set.  Greater than
> say about 1,000 lines.
> 
> James
> 
> On 3/13/2013 4:39 AM, Andreas Niekler wrote:
>> Hello,
>>
>> it was a clean set which i just annotated with the <SPLIT> tags.
>>
>> And the german root bases for those examples are not right in those
>> cases i posted.
>>
>> I used 500 iterations could it be an overfitting problem?
>>
>> Thnakns for you help.
>>
>> Am 13.03.2013 02:38, schrieb James Kosin:
>>> On 3/12/2013 10:22 AM, Andreas Niekler wrote:
>>>> stehenge - blieben
>>>> fre - undlicher
>>> Andreas,
>>>
>>> I'm not an expert on German, but in English the models are also trained
>>> on splitting contractions and other words into their root bases.
>>>
>>> ie:  You'll -split-> You 'll -meaning-> You will
>>>        Can't -split-> Can 't -meaning-> Can not
>>>
>>> Other words may also get parsed and separated by the tokenizer.
>>>
>>> Did you create the training data yourself?  Or was this a clean set of
>>> data from another source?
>>>
>>> James
>>>
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: [email protected]

Reply via email to