On 01/10/2014 05:30 AM, Andrew Bagshaw wrote:
Hi! I’m new to OpenNLP and have been playing around with it in C# .NET
following the instructions at
https://cwiki.apache.org/confluence/display/OPENNLP/Introduction+to+using+openNLP+in+.NET+Projects.
So far everything has been going without a hitch, except I can’t figure out how
to use the training API in C# to train a Name Finder model. I’ve been tinkering
around without success and can’t seem to find any documentation on this.
I've finally managed to get some code that at least compiles but it gives me
this error on the third last line:
An unhandled exception of type 'java.lang.IllegalStateException' occurred in
opennlp.dll
Additional information: java.security.NoSuchAlgorithmException: class
configured for MessageDigest(provider: SUN)cannot be found.
During the training a hash of all the events is computed with the Java
API. It looks like that retrieving the MessageDigest fails on .Net.
I am not sure how that can be fixed. Maybe we could use a different
approach to compute the hash.
A pragmatic solution could be just to epxport the training data to a
file and afterwards use the command line util to do the training (using
the JVM).
My code:
FileReader fileReader = new FileReader("train.txt");
ObjectStream fileStream = new PlainTextByLineStream(fileReader);
ObjectStream sampleStream = new NameSampleDataStream(fileStream);
TokenNameFinderModel test;
test = NameFinderME.train("en", "person", sampleStream,
Collections.emptyMap());
opennlp.tools.namefind.TokenNameFinderModel model = NameFinderME.train("en",
"person", sampleStream, Collections.emptyMap()); //I get the error here
BufferedOutputStream modelOut = new BufferedOutputStream(new
FileOutputStream("test.bin"));
model.serialize(modelOut);
I’m not sure if I’m even on the right track with this code. If someone would be
kind enough to set me on the right track I would be very grateful.
The code looks good.
I have been using the name finder algorithm with the person name finder model
with success, although I find that it misses a bunch of names that I would like
it to detect. Is there a way that I can add to the model (train it without
overwriting the current information in it)? That is what I am trying to
accomplish.
No in our current implementation, that is not possible.
To get a well performing NER model you usually need to annotate your own
data. If you want to process English text, it might be worth to train on
OntoNotes, OpenNLP
has built-in support for it.
HTH,
Jörn