This seems to be a common problem with input text to our models. I think
we should add a normalization tool which a user can run before, or we
maybe even integrate into our models. Its probably better for the user
if it is integrated into the model, because he usually doesn't know the
specifics of the corpus the model was trained on.
Any thoughts? Should we open a jira for it?
Jörn
On 03/29/2013 03:22 AM, William Colen wrote:
In my opinion the tokenizer is working properly and the issue is with the
quotes, wich are unknown by the parser model. I would preprocess the
tokenized text, replacing the quotes by the one known by the model, wich
follows the treebank convention.
On Thu, Mar 28, 2013 at 11:13 PM, James Kosin <[email protected]> wrote:
On 3/28/2013 9:54 AM, Ian Jackson wrote:
I used the prebuilt models for the SetenceModel (en-sent.bin),
TokenizerModel (en-token.bin), and ParserModel (en-parser-chunker.bin) with
the following sentence:
The "quick" brown fox jumps in over the lazy dog.
The result marks the part of speech for the quotes as JJ (for the open)
and (NN for the close) as follows:
(TOP (NP (NP (DT The) (JJ ") (JJ quick) (NN ") (JJ brown) (NN fox) (NNS
jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))
If I alter the sentence as follows changing double quotes to two single
forward quotes and backward quotes [http://www.cis.upenn.edu/~**
treebank/tokenization.html<http://www.cis.upenn.edu/~treebank/tokenization.html>
]:
The `` quick '' brown fox jumps over the lazy dog
The results are as follows:
(TOP (NP (NP (DT The) (`` ``) (JJ quick) ('' '') (JJ brown) (NN fox) (NNS
jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))
Does a method exists to configure the tokenizer to handled quotes within
a sentence?
Training the models with the double quotes instead of the single
forward/backward quote would do the trick.
Would explain why the tokenizer model doesn't do good with my sentences...
I've had to train my own models for a lot of the stuff I'm doing these
days.
Thanks,
James