Re: TokenizerTrainer

Andreas Niekler Thu, 14 Mar 2013 05:33:18 -0700

Hello,

ok i will find out what the name of the tool is and i will create a
rules xml and a abbreviations list (not sure about the format as well
here - but i hope i find an example).


Are you interested in hosting the model after i finally succeed?

Thank you very much

Andreas

Am 14.03.2013 13:25, schrieb Jörn Kottmann:
> On 03/14/2013 12:20 PM, Andreas Niekler wrote:
>> So the detokenizer adds the <SPLIT> tag where it is needed?
> 
> Exactly, you need to merge the tokens again which were previously not
> separated
> by a white space. e.g. "SCHWEIZ/Verlauf :" was in the original text
> "AKTIEN SCHWEIZ/Verlauf:"
> and in the training data you encode that as "AKTIEN
> SCHWEIZ/Verlauf<SPLIT>:".
> 
> The detokenizer just figures out which tokens are merged together and
> which are not
> based on some rules. There is a util which can use that information to
> output the tokenizer
> training data, should be integrated into the CLI but its a while since I
> last used it.
> 
> Don't hesitate to ask if you need more help,
> Jörn
> 

-- 
Andreas Niekler, Dipl. Ing. (FH)
NLP Group | Department of Computer Science
University of Leipzig
Johannisgasse 26 | 04103 Leipzig

mail: [email protected]

Re: TokenizerTrainer

Reply via email to