Do you have the original sources to the text you have?
Or is that it?
If you had the original text, you could start with building a sentence
detector... with one sentence per line. Then make the same sentences
for the tokenizer training file just adding the <SPLIT> where it is
needed to split into two tokens.
i.e.: a line in English may be:
"Today is the day all good men say, a good thing."
Would be tokenized for the training file as
"<SPLIT>Today is the day all good men say<SPLIT>, a good
thing<SPLIT>.<SPLIT>"
This way when the new data is parsed, through the model built it can
generate tokenized files like this:
" Today is the day all good men say , a good thing . "
Here, each token is separated by a space in the final output. What you
seem to have is data that is already tokenized and you are trying to
generate a training file on that data. It isn't impossible but...
nothing you do can get a perfect output of the original without the
original data.
There are some rules that do work, but... not always.
James
On 3/14/2013 1:16 PM, Andreas Niekler wrote:
Maybe just a stupid idea but is it not possible to just use my
whitespace training data and just add one <SPLIT> tag somewhere where it
makes sense. The tonenizer just needs the feature and all the
separations are already made. Abbreviations are not separated in that
file so that it should learn those examles without any further annotation.
But i'm not sure
Am 14.03.2013 14:50, schrieb Jörn Kottmann:
On 03/14/2013 02:15 PM, Andreas Niekler wrote:
Hello,
seems that this issue is already opened by you:
https://issues.apache.org/jira/browse/OPENNLP-501
Shoul i include that into 1.6.0 or just the trunk?
Leave the version open, it would probably be nice to pull that
fix into 1.5.3, but it depends on how quick we get it and what
the other committers think about it, so can't promise anything here.
If it will not go into 1.5.3 it will definitely go into the version after.
Jörn