In my opinion the tokenizer is working properly and the issue is with the quotes, wich are unknown by the parser model. I would preprocess the tokenized text, replacing the quotes by the one known by the model, wich follows the treebank convention.
On Thu, Mar 28, 2013 at 11:13 PM, James Kosin <[email protected]> wrote: > On 3/28/2013 9:54 AM, Ian Jackson wrote: > >> I used the prebuilt models for the SetenceModel (en-sent.bin), >> TokenizerModel (en-token.bin), and ParserModel (en-parser-chunker.bin) with >> the following sentence: >> The "quick" brown fox jumps in over the lazy dog. >> >> The result marks the part of speech for the quotes as JJ (for the open) >> and (NN for the close) as follows: >> (TOP (NP (NP (DT The) (JJ ") (JJ quick) (NN ") (JJ brown) (NN fox) (NNS >> jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .))) >> >> If I alter the sentence as follows changing double quotes to two single >> forward quotes and backward quotes [http://www.cis.upenn.edu/~** >> treebank/tokenization.html<http://www.cis.upenn.edu/~treebank/tokenization.html> >> ]: >> The `` quick '' brown fox jumps over the lazy dog >> >> The results are as follows: >> (TOP (NP (NP (DT The) (`` ``) (JJ quick) ('' '') (JJ brown) (NN fox) (NNS >> jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .))) >> >> Does a method exists to configure the tokenizer to handled quotes within >> a sentence? >> >> Training the models with the double quotes instead of the single > forward/backward quote would do the trick. > Would explain why the tokenizer model doesn't do good with my sentences... > I've had to train my own models for a lot of the stuff I'm doing these > days. > > Thanks, > James >
