Hello, To answer my own question, it seems to be a good idea to discard semi duplicate sentences. I now allow just one occurrence of similar sentences within the Wikipedia corpus. The training set's fast paced growth slows down quick enough.
I have tested the discarded sentences and they are still recognized nicely. So at least in terms of training set size (reduced by 40 %) and model size, sentence deduplication makes sense. Regards, Markus -----Original message----- > From:Markus Jelsma <markus.jel...@openindex.io> > Sent: Tuesday 11th February 2020 19:08 > To: users@opennlp.apache.org > Subject: TokenNameFinder, many semi duplicate entries > > Hello, > > I am generating training sets for TokenNameFinderTrainer based on sentences > from Wikipedia. In some cases there are hundreds or thousands of stub pages > generated by bots about some 'thing' in some 'location'. The extracted > sentences are always very similar. > > For example, these Dutch sentences about some Danish church in some > municipality, and islands located in the Maldives: > Hornbæk is een parochie van de Deense Volkskerk in de Deense gemeente > <START:loc> Randers <END> . > Hørsted is een parochie van de Deense Volkskerk in de Deense gemeente > <START:loc> Thisted <END> . > Hørsholm is een parochie van de Deense Volkskerk in de Deense gemeente > <START:loc> Hørsholm <END> > Hedehusene is een parochie van de Deense Volkskerk in de Deense gemeente > <START:loc> Høje-Taastrup <END> . > Enboodhoofinolhu is een van de onbewoonde eilanden van het Kaafu-atol > behorende tot de <START:loc> Maldiven <END> . > Feydhoofinolhu is een van de onbewoonde eilanden van het Kaafu-atol behorende > tot de <START:loc> Maldiven <END> . > Furan-nafushi is een van de onbewoonde eilanden van het Kaafu-atol behorende > tot de <START:loc> Maldiven <END> . > Fihalhohi is een van de onbewoonde eilanden van het Kaafu-atol behorende tot > de <START:loc> Maldiven <END> . > Het ligt ongeveer 35 km van de hoofdstad <START:geo> Malé <END> . > > Since the data generated from Wikipedia sources is massive, 1.5M sentences > (240MB) and i expect it to grow to about 6M sentences, i am looking for ways > to keep the dataset in its best state, while filtering as much (semi) > duplicate sentences as i can. > > Is it a good idea to get rid of (semi) duplicate sentences? > Is it recommended to to this, for example, the model would otherwise become > too focussed on these examples? > > What do you think? > > Regards, > Markus > > > >