Re: TokenizerTrainer

Jörn Kottmann Fri, 15 Mar 2013 01:45:26 -0700

On 03/15/2013 02:42 AM, James Kosin wrote:

Here, each token is separated by a space in the final output. What youseem to have is data that is already tokenized and you are trying togenerate a training file on that data. It isn't impossible but...nothing you do can get a perfect output of the original without theoriginal data.
There are some rules that do work, but... not always.

We historically always did it like that, because all the corpora wetrained on only have tokenized text and therefore need

to be detokenized somehow to produce training data for the tokenizer.

Jörn

Re: TokenizerTrainer

Reply via email to