Hello,

To answer my own question, it seems to be a good idea to discard semi duplicate 
sentences. I now allow just one occurrence of similar sentences within the 
Wikipedia corpus. The training set's fast paced growth slows down quick enough.

I have tested the discarded sentences and they are still recognized nicely. So 
at least in terms of training set size (reduced by 40 %) and model size, 
sentence deduplication makes sense.

Regards,
Markus

-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Tuesday 11th February 2020 19:08
> To: users@opennlp.apache.org
> Subject: TokenNameFinder, many semi duplicate entries
> 
> Hello,
> 
> I am generating training sets for TokenNameFinderTrainer based on sentences 
> from Wikipedia. In some cases there are hundreds or thousands of stub pages 
> generated by bots about some 'thing' in some 'location'. The extracted 
> sentences are always very similar.
> 
> For example, these Dutch sentences about some Danish church in some 
> municipality, and islands located in the Maldives:
> Hornbæk is een parochie van de Deense Volkskerk in de Deense gemeente 
> <START:loc> Randers <END> .
> Hørsted is een parochie van de Deense Volkskerk in de Deense gemeente 
> <START:loc> Thisted <END> .
> Hørsholm is een parochie van de Deense Volkskerk in de Deense gemeente 
> <START:loc> Hørsholm <END> 
> Hedehusene is een parochie van de Deense Volkskerk in de Deense gemeente 
> <START:loc> Høje-Taastrup <END> .
> Enboodhoofinolhu is een van de onbewoonde eilanden van het Kaafu-atol 
> behorende tot de <START:loc> Maldiven <END> .
> Feydhoofinolhu is een van de onbewoonde eilanden van het Kaafu-atol behorende 
> tot de <START:loc> Maldiven <END> .
> Furan-nafushi is een van de onbewoonde eilanden van het Kaafu-atol behorende 
> tot de <START:loc> Maldiven <END> .
> Fihalhohi is een van de onbewoonde eilanden van het Kaafu-atol behorende tot 
> de <START:loc> Maldiven <END> .
> Het ligt ongeveer 35 km van de hoofdstad <START:geo> Malé <END> .
> 
> Since the data generated from Wikipedia sources is massive, 1.5M sentences 
> (240MB) and i expect it to grow to about 6M sentences, i am looking for ways 
> to keep the dataset in its best state, while filtering as much (semi) 
> duplicate sentences as i can.
> 
> Is it a good idea to get rid of (semi) duplicate sentences?
> Is it recommended to to this, for example, the model would otherwise become 
> too focussed on these examples?
> 
> What do you think?
> 
> Regards,
> Markus 
> 
> 
> 
> 

Reply via email to