On 04/02/2013 06:02 PM, Benson Margulies wrote:
It seems to me to be an invariant that the training and runtime environments have to agree on the input. In this case, it's a matter of agreeing on the text normalization (in the Unicode sense) and the tokenization. I doubt that it is viable to construct a model and runtime that adapt to some disparate collection of possible normalizations and tokenizations.
I didn't use "normalization" here in the Unicode sense, some of the corpora we use (e.g. Penn Treebank) are unified to only use certain tokens for quotes, brackets, etc., these unifications should as well be done for the runtime environment. We currently have no tool in OpenNLP which can do this for the user an I propose that we add one. Jörn
