Re: TokenizerTrainer

James Kosin Thu, 14 Mar 2013 18:42:53 -0700

Do you have the original sources to the text you have?
Or is that it?

If you had the original text, you could start with building a sentencedetector... with one sentence per line. Then make the same sentencesfor the tokenizer training file just adding the <SPLIT> where it isneeded to split into two tokens.


i.e.: a line in English may be:
    "Today is the day all good men say, a good thing."

Would be tokenized for the training file as

"<SPLIT>Today is the day all good men say<SPLIT>, a goodthing<SPLIT>.<SPLIT>"

This way when the new data is parsed, through the model built it cangenerate tokenized files like this:

    " Today is the day all good men say , a good thing . "

Here, each token is separated by a space in the final output. What youseem to have is data that is already tokenized and you are trying togenerate a training file on that data. It isn't impossible but...nothing you do can get a perfect output of the original without theoriginal data.


There are some rules that do work, but... not always.

James

On 3/14/2013 1:16 PM, Andreas Niekler wrote:

Maybe just a stupid idea but is it not possible to just use my
whitespace training data and just add one <SPLIT> tag somewhere where it
makes sense. The tonenizer just needs the feature and all the
separations are already made. Abbreviations are not separated in that
file so that it should learn those examles without any further annotation.

But i'm not sure



Am 14.03.2013 14:50, schrieb Jörn Kottmann:

On 03/14/2013 02:15 PM, Andreas Niekler wrote:

Hello,

seems that this issue is already opened by you:
https://issues.apache.org/jira/browse/OPENNLP-501

Shoul i include that into 1.6.0 or just the trunk?

Leave the version open, it would probably be nice to pull that
fix into 1.5.3, but it depends on how quick we get it and what
the other committers think about it, so can't promise anything here.
If it will not go into 1.5.3 it will definitely go into the version after.

Jörn

Re: TokenizerTrainer

Reply via email to