On 3/28/2013 9:54 AM, Ian Jackson wrote:
Training the models with the double quotes instead of the single forward/backward quote would do the trick. Would explain why the tokenizer model doesn't do good with my sentences... I've had to train my own models for a lot of the stuff I'm doing these days.I used the prebuilt models for the SetenceModel (en-sent.bin), TokenizerModel (en-token.bin), and ParserModel (en-parser-chunker.bin) with the following sentence: The "quick" brown fox jumps in over the lazy dog.The result marks the part of speech for the quotes as JJ (for the open) and (NN for the close) as follows: (TOP (NP (NP (DT The) (JJ ") (JJ quick) (NN ") (JJ brown) (NN fox) (NNS jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .))) If I alter the sentence as follows changing double quotes to two single forward quotes and backward quotes [http://www.cis.upenn.edu/~treebank/tokenization.html]: The `` quick '' brown fox jumps over the lazy dog The results are as follows: (TOP (NP (NP (DT The) (`` ``) (JJ quick) ('' '') (JJ brown) (NN fox) (NNS jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .))) Does a method exists to configure the tokenizer to handled quotes within a sentence?
Thanks, James
