Re: TokenizerTrainer

James Kosin Wed, 13 Mar 2013 04:16:29 -0700

Andreas,

Tokenizing is a very simple procedure; so, the default of 100 iterationsshould suffice as long as you have a large training set. Greater thansay about 1,000 lines.


James

On 3/13/2013 4:39 AM, Andreas Niekler wrote:

Hello,

it was a clean set which i just annotated with the <SPLIT> tags.

And the german root bases for those examples are not right in those
cases i posted.

I used 500 iterations could it be an overfitting problem?

Thnakns for you help.

Am 13.03.2013 02:38, schrieb James Kosin:

On 3/12/2013 10:22 AM, Andreas Niekler wrote:

stehenge - blieben
fre - undlicher

Andreas,

I'm not an expert on German, but in English the models are also trained
on splitting contractions and other words into their root bases.

ie:  You'll -split-> You 'll -meaning-> You will
       Can't -split-> Can 't -meaning-> Can not

Other words may also get parsed and separated by the tokenizer.

Did you create the training data yourself?  Or was this a clean set of
data from another source?

James

Re: TokenizerTrainer

Reply via email to