Re: TokenizerTrainer

Jörn Kottmann Thu, 14 Mar 2013 05:25:49 -0700

On 03/14/2013 12:20 PM, Andreas Niekler wrote:

So the detokenizer adds the <SPLIT> tag where it is needed?

Exactly, you need to merge the tokens again which were previously notseparatedby a white space. e.g. "SCHWEIZ/Verlauf :" was in the original text"AKTIEN SCHWEIZ/Verlauf:"and in the training data you encode that as "AKTIENSCHWEIZ/Verlauf<SPLIT>:".

The detokenizer just figures out which tokens are merged together andwhich are notbased on some rules. There is a util which can use that information tooutput the tokenizertraining data, should be integrated into the CLI but its a while since Ilast used it.


Don't hesitate to ask if you need more help,
Jörn

Re: TokenizerTrainer

Reply via email to