On 03/14/2013 09:59 AM, Andreas Niekler wrote:
Hello,
i just added the <SPLIT> Tag because all (only) whitespace files weren't
able to processed by the command line tool. It just found 1 Feature and
the training endet with an exception like "Unable to create model due
to" in the first interation and all the liklihoods are 1.0. I just
replaced all whitespaces with the split tag as described in the
documentation.
If you want to tokenize based on white spaces I suggest to use our white
space tokenizer.
In the training data the <SPLIT> tag is usually only used for whitespace
separated strings where
more than one token occurs in one string, e.g. "... said: ..." and that
is in the training data "... said<SPLIT>: ...".
You usually need to manually produce this data, or you use some corpus
which already contains tokenized
text and use the de-tokenizer to produce the training data.
Another option is to use a penn treebank tokenizer, the cTAKES people
are doing that and have one for UIMA,
they might contribute it to OpenNLP one day.
Jörn