On 04/15/2013 02:31 AM, Richard Head Jr. wrote:
I have a bunch of sentences like the following:
Guacamole Dip: 5 Hass Avocados, Jalapeno Puree with Salt and BHT (preservative).
They are standalone, i.e., they are not contained within a larger
paragraph/document structure.
I want to tag various words, creating the following:
Guacamole Dip: 5 Hass <START:term>Avocados<END>, <START:term>Jalapeno<END> Puree with
<START:term>Salt<END> and <START:term>BHT<END> (preservative).
Looking through the mailing list for guidance, I came across this:
http://mail-archives.apache.org/mod_mbox/opennlp-users/201205.mbox/%3C4FA1EE7E.2080608%40gmail.com%3E
Which made me think that, before going though a 100 or so documents and tagging
the words to create training data, I should get some clarification on the
following:
1. Is NER the right tool for this?
2. My training data is somewhat small (~100 sentences) will this stymie my goal
above?
3. Were the poor results the gentleman had with Italian addresses in part do to
a bug mentioned here:
http://mail-archives.apache.org/mod_mbox/opennlp-users/201205.mbox/%3C4FA1EF10.2020904%40gmail.com%3E
4. Is it possible to use a text file containing only terms, or a tab delimited
file like the ones the Stanford NER uses?
Yes, the NER should be capable of detecting the terms, but you could
also try to use a dictionary.
Your training data is too small, especially when you train with a cutoff
of 5 and the maxent model,
the perceptron will work better. Label more data until you have a few
thousand sentences.
The mentioned bug was fixed in 1.5.3, but it only occurred in multi type
models.
You need complete sentences to train the NER model, just using the terms
does not work, no we do not support the Stanford format.
Jörn