On 03/15/2013 02:42 AM, James Kosin wrote:

Here, each token is separated by a space in the final output. What you seem to have is data that is already tokenized and you are trying to generate a training file on that data. It isn't impossible but... nothing you do can get a perfect output of the original without the original data.

There are some rules that do work, but... not always.

We historically always did it like that, because all the corpora we trained on only have tokenized text and therefore need
to be detokenized somehow to produce training data for the tokenizer.

Jörn

Reply via email to