Re: TokenizerTrainer

Jörn Kottmann Thu, 14 Mar 2013 02:15:52 -0700

On 03/14/2013 09:59 AM, Andreas Niekler wrote:

Hello,


i just added the <SPLIT> Tag because all (only) whitespace files weren't
able to processed by the command line tool. It just found 1 Feature and
the training endet with an exception like "Unable to create model due
to" in the first interation and all the liklihoods are 1.0. I just
replaced all whitespaces with the split tag as described in the
documentation.

If you want to tokenize based on white spaces I suggest to use our whitespace tokenizer.

In the training data the <SPLIT> tag is usually only used for whitespaceseparated strings wheremore than one token occurs in one string, e.g. "... said: ..." and thatis in the training data "... said<SPLIT>: ...".You usually need to manually produce this data, or you use some corpuswhich already contains tokenized

text and use the de-tokenizer to produce the training data.

Another option is to use a penn treebank tokenizer, the cTAKES peopleare doing that and have one for UIMA,

they might contribute it to OpenNLP one day.

Jörn

Re: TokenizerTrainer

Reply via email to