Re: Using synthetic data for the Name Finder

Joern Kottmann Wed, 07 Mar 2018 06:41:21 -0800

Hello,

this is probably the change of the default from maxent to perceptron,
on many data sets the perceptron outperforms the maxent and therefore
it was decided to make it the default for newly trained models.
Take a look at the lang/ml folder in the distribution, it has a params
file to train with maxent instead, and works with the -params cli
argument.


HTH,
Jörn

On Wed, Mar 7, 2018 at 3:34 PM, Fraser Bowen
<fraser.bo...@westernacher.com> wrote:
> Hello OpenNLP community,
>
> We are using the OpenNLP Name Finder to train models on a domain specific 
> German dataset. However, since upgrading from version 1.6.0 to 1.8.4, I have 
> noticed that the Name Finder model is much better, but no longer robust.
>
> Using the small amount of data we have, the new version improves upon the 
> F-score on our test set.
>
> However, in order to boost the small amount of training data that I have, I 
> have generated some "synthetic" data. It's imaginable that this "unclean" 
> data would confuse the model, but in 1.6.0, it would improve the F-score. 
> This is no longer the case in 1.8.4: any manipulations to the data appear to 
> confuse the model and cause it to find many false positives.
>
> I'd like to understand a little better what has changed between these two 
> versions, but the release notes aren't very descriptive. Has anybody else 
> experienced any wild changes with the new version?
>
> Many thanks in advance!
> Fraser

Re: Using synthetic data for the Name Finder

Reply via email to