Re: Handling of Quotes

Jörn Kottmann Tue, 02 Apr 2013 01:37:58 -0700

This seems to be a common problem with input text to our models. I thinkwe should add a normalization tool which a user can run before, or wemaybe even integrate into our models. Its probably better for the userif it is integrated into the model, because he usually doesn't know thespecifics of the corpus the model was trained on.


Any thoughts? Should we open a jira for it?


Jörn

On 03/29/2013 03:22 AM, William Colen wrote:

In my opinion the tokenizer is working properly and the issue is with the
quotes, wich are unknown by the parser model. I would preprocess the
tokenized text, replacing the quotes by the one known by the model, wich
follows the treebank convention.


On Thu, Mar 28, 2013 at 11:13 PM, James Kosin <[email protected]> wrote:

On 3/28/2013 9:54 AM, Ian Jackson wrote:

I used the prebuilt models for the SetenceModel (en-sent.bin),
TokenizerModel (en-token.bin), and ParserModel (en-parser-chunker.bin) with
the following sentence:
     The "quick" brown fox jumps in over the lazy dog.

The result marks the part of speech for the quotes as JJ (for the open)
and (NN for the close) as follows:
(TOP (NP (NP (DT The) (JJ ") (JJ quick) (NN ") (JJ brown) (NN fox) (NNS
jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))

If I alter the sentence as follows changing double quotes to two single
forward quotes and backward quotes [http://www.cis.upenn.edu/~**
treebank/tokenization.html<http://www.cis.upenn.edu/~treebank/tokenization.html>
]:
     The `` quick '' brown fox jumps over the lazy dog

The results are as follows:
(TOP (NP (NP (DT The) (`` ``) (JJ quick) ('' '') (JJ brown) (NN fox) (NNS
jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))

Does a method exists to configure the tokenizer to handled quotes within
a sentence?

  Training the models with the double quotes instead of the single

forward/backward quote would do the trick.
Would explain why the tokenizer model doesn't do good with my sentences...
  I've had to train my own models for a lot of the stuff I'm doing these
days.

Thanks,
James

Re: Handling of Quotes

Reply via email to