Hi Dan,
The dictionary element is to add to the name recognizer to help find
names that don't match or to help enforce name recognition here. I'm
not exactly sure if this is quite what you want to do.
There is a lesser used Dictionary name finder that may be more suited to
what you are wanting to do... I think. But, the current version in
1.5.2 has a few bugs. You can get a pre-release here:
http://people.apache.org/~colen/releases/opennlp-1.5.3/rc2/ of our next
release to help with the problems.
The dictionary format is fairly straight forward .... though not well
documented. There are also several CLI tools to convert files to a
dictionary format.
I guess I'll try to better the documentation here.... :-)
<?xml version="1.0" encoding="UTF-8"?><dictionary case_sensitive="true">
<entry>
<token>Patrick</token>
</entry>
</dictionary>
The dictionary contains entries for the tokens for each. When the
DictionaryNameFinder is called, it will attempt to find the longest
matching series from the dictionary in the document.
This sort of dictionary is best for keywords, some names and special
words. You could use this type of dictionary populated with the
keywords for c/c++ and it could parse and tag a program file with all
the keywords.
Let me know if I'm headed down the wrong path here....
Thanks,
James
On 3/8/2013 11:56 PM, Daniel Franc wrote:
Hi James,
Thanks for your reply. Maybe my questions are too elementary so sorry!
I was running through the OpenNLP manual and went through the
"tokenizer" step
(http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.tokenizer).
Then when running through the "name finder" step it alluded to an
alternative separate dictionary lookup step (end of this section:
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.recognition.api)
I was able to create a dictionary for lookup, but I can't figure out
how to load it up or search with it.
My eventual goal is have a method to look up a set of terms within a
document as an alternative way to classify or tag the document and not
necessarily use the statistical name finder. I'm not familiar with
JWNL but I could give that a try. It seems that I could manually code
a text search through a document, but I thought I'd try to use OpenNLP
first.
Thanks again -- Dan
On Fri, Mar 8, 2013 at 4:22 PM, James Kosin <[email protected]
<mailto:[email protected]>> wrote:
Dan,
I'm guessing when you say tokenized you mean with POS values. If
so, a better approach would be to use the JWNL library to look up
the dictionary terms. We use this with our coref component and
isn't hard to get working. The biggest thing with POS is
selecting the right one. It may be better to build a model for
the POS tokenizer than to build a dictionary for this. Unless you
are meaning for a different language.
I guess I need more information from you on what you are trying to
accomplish?
James
On 3/8/2013 6:05 PM, Daniel Franc wrote:
Hello friends,
I am at a novice level for both OpenNLP and Java and have been
fumbling
around to put together a working version of the software with
some success
thanks to the documentation provided! My eventual goal is
partially to
look up terms within a pre-defined dictionary, and I've been
able to use
the dictionary creator to create a basic dictionary to lookup
from as here:
dictionary.serialize(new FileOutputStream(
"/Applications/apache-opennlp-1.5.2-incubating/dictionarynames.txt"));
My particular questions are:
1. Can someone help me with loading this dictionary after it
was previously
created?
2. Is there a straightforward was to implement a basic lookup
mechanism for
tokenized text?
Thanks for your help!
-Dan