Lucene has solved this problem with more sophisticated tools than
OpenNLP has, and with custom support for many languages.
On 04/02/2013 01:37 AM, Jörn Kottmann wrote:
This seems to be a common problem with input text to our models. I
think we should add a normalization tool which a user can run before,
or we maybe even integrate into our models. Its probably better for
the user if it is integrated into the model, because he usually
doesn't know the specifics of the corpus the model was trained on.
Any thoughts? Should we open a jira for it?
Jörn
On 03/29/2013 03:22 AM, William Colen wrote:
In my opinion the tokenizer is working properly and the issue is with
the
quotes, wich are unknown by the parser model. I would preprocess the
tokenized text, replacing the quotes by the one known by the model, wich
follows the treebank convention.
On Thu, Mar 28, 2013 at 11:13 PM, James Kosin <[email protected]>
wrote:
On 3/28/2013 9:54 AM, Ian Jackson wrote:
I used the prebuilt models for the SetenceModel (en-sent.bin),
TokenizerModel (en-token.bin), and ParserModel
(en-parser-chunker.bin) with
the following sentence:
The "quick" brown fox jumps in over the lazy dog.
The result marks the part of speech for the quotes as JJ (for the
open)
and (NN for the close) as follows:
(TOP (NP (NP (DT The) (JJ ") (JJ quick) (NN ") (JJ brown) (NN fox)
(NNS
jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))
If I alter the sentence as follows changing double quotes to two
single
forward quotes and backward quotes [http://www.cis.upenn.edu/~**
treebank/tokenization.html<http://www.cis.upenn.edu/~treebank/tokenization.html>
]:
The `` quick '' brown fox jumps over the lazy dog
The results are as follows:
(TOP (NP (NP (DT The) (`` ``) (JJ quick) ('' '') (JJ brown) (NN
fox) (NNS
jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))
Does a method exists to configure the tokenizer to handled quotes
within
a sentence?
Training the models with the double quotes instead of the single
forward/backward quote would do the trick.
Would explain why the tokenizer model doesn't do good with my
sentences...
I've had to train my own models for a lot of the stuff I'm doing
these
days.
Thanks,
James