Hello, > We probably need to fix the detokenizer rules used for the German models > a bit to handle these cases correctly.
Are those rules public somewhere so that i can edit them myself? I can provide them to the community afterwars. Mostly characters like „“ are not recognized by the tokenizer. I don't want to convert them before tokenizing because we analyze things like direct speech and those characters are a good indicator for that. > I suggest to use our detokenizer to turn your tokenized text into > training data. Has the detokenizer a command line tool as well? Thank you all Andreas -- Andreas Niekler, Dipl. Ing. (FH) NLP Group | Department of Computer Science University of Leipzig Johannisgasse 26 | 04103 Leipzig mail: [email protected]
