Re: TokenizerTrainer

Jörn Kottmann Thu, 14 Mar 2013 05:49:09 -0700

Have a look here:
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.tokenizer.detokenizing


Here is the detokenizer tool:
https://github.com/apache/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/tokenizer/DictionaryDetokenizerTool.java

Looks like it doesn't output the <SPLIT> tag, we should change that. Themain purpose of it is to generate training datafor the tokenizer. Anyway, patches to improve the detokenizer are verywelcome, looks like the documentation needs a few

fixes too.

HTH,
Jörn

On 03/14/2013 01:32 PM, Andreas Niekler wrote:

Hello,

ok i will find out what the name of the tool is and i will create a
rules xml and a abbreviations list (not sure about the format as well
here - but i hope i find an example).

Are you interested in hosting the model after i finally succeed?

Thank you very much

Andreas

Am 14.03.2013 13:25, schrieb Jörn Kottmann:

On 03/14/2013 12:20 PM, Andreas Niekler wrote:

So the detokenizer adds the <SPLIT> tag where it is needed?

Exactly, you need to merge the tokens again which were previously not
separated
by a white space. e.g. "SCHWEIZ/Verlauf :" was in the original text
"AKTIEN SCHWEIZ/Verlauf:"
and in the training data you encode that as "AKTIEN
SCHWEIZ/Verlauf<SPLIT>:".

The detokenizer just figures out which tokens are merged together and
which are not
based on some rules. There is a util which can use that information to
output the tokenizer
training data, should be integrated into the CLI but its a while since I
last used it.

Don't hesitate to ask if you need more help,
Jörn

Re: TokenizerTrainer

Reply via email to