Hello, ok i will find out what the name of the tool is and i will create a rules xml and a abbreviations list (not sure about the format as well here - but i hope i find an example).
Are you interested in hosting the model after i finally succeed? Thank you very much Andreas Am 14.03.2013 13:25, schrieb Jörn Kottmann: > On 03/14/2013 12:20 PM, Andreas Niekler wrote: >> So the detokenizer adds the <SPLIT> tag where it is needed? > > Exactly, you need to merge the tokens again which were previously not > separated > by a white space. e.g. "SCHWEIZ/Verlauf :" was in the original text > "AKTIEN SCHWEIZ/Verlauf:" > and in the training data you encode that as "AKTIEN > SCHWEIZ/Verlauf<SPLIT>:". > > The detokenizer just figures out which tokens are merged together and > which are not > based on some rules. There is a util which can use that information to > output the tokenizer > training data, should be integrated into the CLI but its a while since I > last used it. > > Don't hesitate to ask if you need more help, > Jörn > -- Andreas Niekler, Dipl. Ing. (FH) NLP Group | Department of Computer Science University of Leipzig Johannisgasse 26 | 04103 Leipzig mail: [email protected]
