Hello OpenNLP community,
We are using the OpenNLP Name Finder to train models on a domain specific
German dataset. However, since upgrading from version 1.6.0 to 1.8.4, I have
noticed that the Name Finder model is much better, but no longer robust.
Using the small amount of data we have, the new version improves upon the
F-score on our test set.
However, in order to boost the small amount of training data that I have, I
have generated some "synthetic" data. It's imaginable that this "unclean" data
would confuse the model, but in 1.6.0, it would improve the F-score. This is no
longer the case in 1.8.4: any manipulations to the data appear to confuse the
model and cause it to find many false positives.
I'd like to understand a little better what has changed between these two
versions, but the release notes aren't very descriptive. Has anybody else
experienced any wild changes with the new version?
Many thanks in advance!