On 03/14/2013 09:59 AM, Andreas Niekler wrote:
Hello,

i just added the <SPLIT> Tag because all (only) whitespace files weren't
able to processed by the command line tool. It just found 1 Feature and
the training endet with an exception like "Unable to create model due
to" in the first interation and all the liklihoods are 1.0. I just
replaced all whitespaces with the split tag as described in the
documentation.


If you want to tokenize based on white spaces I suggest to use our white space tokenizer.

In the training data the <SPLIT> tag is usually only used for whitespace separated strings where more than one token occurs in one string, e.g. "... said: ..." and that is in the training data "... said<SPLIT>: ...". You usually need to manually produce this data, or you use some corpus which already contains tokenized
text and use the de-tokenizer to produce the training data.

Another option is to use a penn treebank tokenizer, the cTAKES people are doing that and have one for UIMA,
they might contribute it to OpenNLP one day.

Jörn

Reply via email to