Re: Handling of Quotes

Lance Norskog Tue, 02 Apr 2013 08:51:00 -0700

Lucene has solved this problem with more sophisticated tools thanOpenNLP has, and with custom support for many languages.


On 04/02/2013 01:37 AM, Jörn Kottmann wrote:

This seems to be a common problem with input text to our models. Ithink we should add a normalization tool which a user can run before,or we maybe even integrate into our models. Its probably better forthe user if it is integrated into the model, because he usuallydoesn't know the specifics of the corpus the model was trained on.
Any thoughts? Should we open a jira for it?

Jörn

On 03/29/2013 03:22 AM, William Colen wrote:
In my opinion the tokenizer is working properly and the issue is withthe
quotes, wich are unknown by the parser model. I would preprocess the
tokenized text, replacing the quotes by the one known by the model, wich
follows the treebank convention.
On Thu, Mar 28, 2013 at 11:13 PM, James Kosin <[email protected]>wrote:
On 3/28/2013 9:54 AM, Ian Jackson wrote:
I used the prebuilt models for the SetenceModel (en-sent.bin),
TokenizerModel (en-token.bin), and ParserModel(en-parser-chunker.bin) with
the following sentence:
     The "quick" brown fox jumps in over the lazy dog.
The result marks the part of speech for the quotes as JJ (for theopen)
and (NN for the close) as follows:
(TOP (NP (NP (DT The) (JJ ") (JJ quick) (NN ") (JJ brown) (NN fox)(NNS
jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))
If I alter the sentence as follows changing double quotes to twosingle
forward quotes and backward quotes [http://www.cis.upenn.edu/~**
treebank/tokenization.html<http://www.cis.upenn.edu/~treebank/tokenization.html>
]:
     The `` quick '' brown fox jumps over the lazy dog

The results are as follows:
(TOP (NP (NP (DT The) (`` ``) (JJ quick) ('' '') (JJ brown) (NNfox) (NNS
jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))
Does a method exists to configure the tokenizer to handled quoteswithin
a sentence?

  Training the models with the double quotes instead of the single
forward/backward quote would do the trick.
Would explain why the tokenizer model doesn't do good with mysentences...I've had to train my own models for a lot of the stuff I'm doingthese
days.

Thanks,
James

Re: Handling of Quotes

Reply via email to