It seems to me to be an invariant that the training and runtime environments have to agree on the input. In this case, it's a matter of agreeing on the text normalization (in the Unicode sense) and the tokenization. I doubt that it is viable to construct a model and runtime that adapt to some disparate collection of possible normalizations and tokenizations.
On Tue, Apr 2, 2013 at 11:50 AM, Lance Norskog <[email protected]> wrote: > Lucene has solved this problem with more sophisticated tools than OpenNLP > has, and with custom support for many languages. > > On 04/02/2013 01:37 AM, Jörn Kottmann wrote: >> >> This seems to be a common problem with input text to our models. I think >> we should add a normalization tool which a user can run before, or we maybe >> even integrate into our models. Its probably better for the user if it is >> integrated into the model, because he usually doesn't know the specifics of >> the corpus the model was trained on. >> >> Any thoughts? Should we open a jira for it? >> >> Jörn >> >> On 03/29/2013 03:22 AM, William Colen wrote: >>> >>> In my opinion the tokenizer is working properly and the issue is with the >>> quotes, wich are unknown by the parser model. I would preprocess the >>> tokenized text, replacing the quotes by the one known by the model, wich >>> follows the treebank convention. >>> >>> >>> On Thu, Mar 28, 2013 at 11:13 PM, James Kosin <[email protected]> >>> wrote: >>> >>>> On 3/28/2013 9:54 AM, Ian Jackson wrote: >>>> >>>>> I used the prebuilt models for the SetenceModel (en-sent.bin), >>>>> TokenizerModel (en-token.bin), and ParserModel (en-parser-chunker.bin) >>>>> with >>>>> the following sentence: >>>>> The "quick" brown fox jumps in over the lazy dog. >>>>> >>>>> The result marks the part of speech for the quotes as JJ (for the open) >>>>> and (NN for the close) as follows: >>>>> (TOP (NP (NP (DT The) (JJ ") (JJ quick) (NN ") (JJ brown) (NN fox) (NNS >>>>> jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .))) >>>>> >>>>> If I alter the sentence as follows changing double quotes to two single >>>>> forward quotes and backward quotes [http://www.cis.upenn.edu/~** >>>>> >>>>> treebank/tokenization.html<http://www.cis.upenn.edu/~treebank/tokenization.html> >>>>> ]: >>>>> The `` quick '' brown fox jumps over the lazy dog >>>>> >>>>> The results are as follows: >>>>> (TOP (NP (NP (DT The) (`` ``) (JJ quick) ('' '') (JJ brown) (NN fox) >>>>> (NNS >>>>> jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .))) >>>>> >>>>> Does a method exists to configure the tokenizer to handled quotes >>>>> within >>>>> a sentence? >>>>> >>>>> Training the models with the double quotes instead of the single >>>> >>>> forward/backward quote would do the trick. >>>> Would explain why the tokenizer model doesn't do good with my >>>> sentences... >>>> I've had to train my own models for a lot of the stuff I'm doing these >>>> days. >>>> >>>> Thanks, >>>> James >>>> >> >> >
