Re: TokenizerTrainer

Andreas Niekler Wed, 13 Mar 2013 01:40:11 -0700

Hello,

it was a clean set which i just annotated with the <SPLIT> tags.


And the german root bases for those examples are not right in those
cases i posted.

I used 500 iterations could it be an overfitting problem?

Thnakns for you help.

Am 13.03.2013 02:38, schrieb James Kosin:
> On 3/12/2013 10:22 AM, Andreas Niekler wrote:
>> stehenge - blieben
>> fre - undlicher
> Andreas,
> 
> I'm not an expert on German, but in English the models are also trained
> on splitting contractions and other words into their root bases.
> 
> ie:  You'll -split-> You 'll -meaning-> You will
>       Can't -split-> Can 't -meaning-> Can not
> 
> Other words may also get parsed and separated by the tokenizer.
> 
> Did you create the training data yourself?  Or was this a clean set of
> data from another source?
> 
> James
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: [email protected]

Re: TokenizerTrainer

Reply via email to