It seems to me to be an invariant that the training and runtime
environments have to agree on the input. In this case, it's a matter
of agreeing on the text normalization (in the Unicode sense) and the
tokenization. I doubt that it is viable to construct a model and
runtime that adapt to some disparate collection of possible
normalizations and tokenizations.

On Tue, Apr 2, 2013 at 11:50 AM, Lance Norskog <[email protected]> wrote:
> Lucene has solved this problem with more sophisticated tools than OpenNLP
> has, and with custom support for many languages.
>
> On 04/02/2013 01:37 AM, Jörn Kottmann wrote:
>>
>> This seems to be a common problem with input text to our models. I think
>> we should add a normalization tool which a user can run before, or we maybe
>> even integrate into our models. Its probably better for the user if it is
>> integrated into the model, because he usually doesn't know the specifics of
>> the corpus the model was trained on.
>>
>> Any thoughts? Should we open a jira for it?
>>
>> Jörn
>>
>> On 03/29/2013 03:22 AM, William Colen wrote:
>>>
>>> In my opinion the tokenizer is working properly and the issue is with the
>>> quotes, wich are unknown by the parser model. I would preprocess the
>>> tokenized text, replacing the quotes by the one known by the model, wich
>>> follows the treebank convention.
>>>
>>>
>>> On Thu, Mar 28, 2013 at 11:13 PM, James Kosin <[email protected]>
>>> wrote:
>>>
>>>> On 3/28/2013 9:54 AM, Ian Jackson wrote:
>>>>
>>>>> I used the prebuilt models for the SetenceModel (en-sent.bin),
>>>>> TokenizerModel (en-token.bin), and ParserModel (en-parser-chunker.bin)
>>>>> with
>>>>> the following sentence:
>>>>>      The "quick" brown fox jumps in over the lazy dog.
>>>>>
>>>>> The result marks the part of speech for the quotes as JJ (for the open)
>>>>> and (NN for the close) as follows:
>>>>> (TOP (NP (NP (DT The) (JJ ") (JJ quick) (NN ") (JJ brown) (NN fox) (NNS
>>>>> jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))
>>>>>
>>>>> If I alter the sentence as follows changing double quotes to two single
>>>>> forward quotes and backward quotes [http://www.cis.upenn.edu/~**
>>>>>
>>>>> treebank/tokenization.html<http://www.cis.upenn.edu/~treebank/tokenization.html>
>>>>> ]:
>>>>>      The `` quick '' brown fox jumps over the lazy dog
>>>>>
>>>>> The results are as follows:
>>>>> (TOP (NP (NP (DT The) (`` ``) (JJ quick) ('' '') (JJ brown) (NN fox)
>>>>> (NNS
>>>>> jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))
>>>>>
>>>>> Does a method exists to configure the tokenizer to handled quotes
>>>>> within
>>>>> a sentence?
>>>>>
>>>>>   Training the models with the double quotes instead of the single
>>>>
>>>> forward/backward quote would do the trick.
>>>> Would explain why the tokenizer model doesn't do good with my
>>>> sentences...
>>>>   I've had to train my own models for a lot of the stuff I'm doing these
>>>> days.
>>>>
>>>> Thanks,
>>>> James
>>>>
>>
>>
>

Reply via email to