On 03/14/2013 12:20 PM, Andreas Niekler wrote:
So the detokenizer adds the <SPLIT> tag where it is needed?
Exactly, you need to merge the tokens again which were previously not separated by a white space. e.g. "SCHWEIZ/Verlauf :" was in the original text "AKTIEN SCHWEIZ/Verlauf:" and in the training data you encode that as "AKTIEN SCHWEIZ/Verlauf<SPLIT>:".
The detokenizer just figures out which tokens are merged together and which are not based on some rules. There is a util which can use that information to output the tokenizer training data, should be integrated into the CLI but its a while since I last used it.
Don't hesitate to ask if you need more help, Jörn
