Hi--
I am using the opennlp framework for a sequence tagging task, and have
written the necessary code to handle input data in the following format:
<token1> [<feature1=value> <feature2=value> ...] <tag1>
<token2> [<feature1=value> <feature5=value> ...] <tag2>
....
....
<tokenX> [<feature10=value> ...] <tagX>
<empty-line-to-separate-sentences>
[more sentences follow]
The problem I am faced with is as follows. opennlp.tools.util.BeamSearch
uses the following API on openlp.maxent.GISModel in the bestSequences(...)
method:
public double[] eval(String[] context, double probs[]);
In this version of eval(...), the real values for the input features are
not used.
Instead, if bestSequences can use the following API,
public double[] eval(String[] context, float[] values);
it can correctly use the real-valued features parsed from the above format.
Currently, I have subclassed BeamSearch, and overridden the bestSequences
method, which parses the contexts using the RealValueFileEventStream class.
However, for the most part, this derived class and the new bestSequences
method is a copy of the original, leading to unnecessary code duplication.
I am wondering if there is any possibility (and utility) of incorporating
this logic into the the original BeamSearch code. In particular would it be
acceptable to do something like this (ignoring caching logic for now):
float[] values = RealValueFileEventStream.parseContexts(contexts);
double scores = model.eval(contexts, values);
RealValueFileEventStream.parseContexts returns null if it cannot parse even
a single valid float value from the input contexts. In that case,
GISModel.eval(...) will ignore a null values array (using the default value
of 1 for every context).
One problem with this would be that there would be an implicit convention
that context generators will have to know-- any context in the format
"feature=value" will be considered for parsing a real-valued feature. So
'=' becomes a special character in some sense. Note that it won't be a
problem if the string following '=' cannot be parsed as a float (the
feature just gets the default value of 1). If it is parsed as a valid
negative value though, there will be a RuntimeException thrown from
RealValueFileEventStream.parseContext.
Another issue is the possible performance hit due to the parseContexts call.
Any thoughts on this issue are most welcome. Have other users of opennlp
encountered a similar situation, and how was it handled?
Thanks!
Mahesh