Do you have the original sources to the text you have?
Or is that it?

If you had the original text, you could start with building a sentence detector... with one sentence per line. Then make the same sentences for the tokenizer training file just adding the <SPLIT> where it is needed to split into two tokens.

i.e.: a line in English may be:
    "Today is the day all good men say, a good thing."

Would be tokenized for the training file as
"<SPLIT>Today is the day all good men say<SPLIT>, a good thing<SPLIT>.<SPLIT>"

This way when the new data is parsed, through the model built it can generate tokenized files like this:
    " Today is the day all good men say , a good thing . "

Here, each token is separated by a space in the final output. What you seem to have is data that is already tokenized and you are trying to generate a training file on that data. It isn't impossible but... nothing you do can get a perfect output of the original without the original data.

There are some rules that do work, but... not always.

James

On 3/14/2013 1:16 PM, Andreas Niekler wrote:
Maybe just a stupid idea but is it not possible to just use my
whitespace training data and just add one <SPLIT> tag somewhere where it
makes sense. The tonenizer just needs the feature and all the
separations are already made. Abbreviations are not separated in that
file so that it should learn those examles without any further annotation.

But i'm not sure



Am 14.03.2013 14:50, schrieb Jörn Kottmann:
On 03/14/2013 02:15 PM, Andreas Niekler wrote:
Hello,

seems that this issue is already opened by you:
https://issues.apache.org/jira/browse/OPENNLP-501

Shoul i include that into 1.6.0 or just the trunk?
Leave the version open, it would probably be nice to pull that
fix into 1.5.3, but it depends on how quick we get it and what
the other committers think about it, so can't promise anything here.
If it will not go into 1.5.3 it will definitely go into the version after.

Jörn


Reply via email to