Re: Training data sets size for Word Tokenizer and Sentence Detector

Nikolai Krot Fri, 29 Sep 2017 03:16:03 -0700

Hi Ben,

Science, often driven by grants, should offer its fruit free of charge. My
personal opinion.


Wikipedia is a great resource. Both tokenization corpus and NER corpus can
to some extent be induced from parallel texts plus a bilingual dictionary
for common words. It is just time consuming.

Best,
Nikolai


On Mon, Sep 25, 2017 at 6:36 PM, Benedict Holland <
benedict.m.holl...@gmail.com> wrote:

> Hello,
>
> I am almost certain that you will have to pay for data sources. There are a
> few that are very reasonable, such as the entire Wikipedia set (roughly 3
> billion words) across many languages. I have not found a free one,
> particularly for names, and I would be very interested in that possibility.
> That said, I believe this is how linguists figured out how to make money
> and after an extensive search, we couldn't find a good data source that was
> also free.
>
> Thanks,
> ~Ben
>
> On Mon, Sep 25, 2017 at 10:13 AM, Joern Kottmann <jo...@apache.org> wrote:
>
> > Hello,
> >
> > you can get good results with something in the range of  10k sentences.
> > You should not use fake/generated data for training since that usually
> > gives bad results.
> >
> > For what kind of domain do you train the models? Which languages?
> >
> > Jörn
> >
> > On Thu, Sep 21, 2017 at 1:13 PM, Nikolai Krot <tal...@gmail.com> wrote:
> > > Hi colleagues,
> > >
> > > I want to train my own models (possibly on a modified set of features)
> > for
> > > word tokenization and sentence detection, My question is how much
> > training
> > > data is required for a reliable model.
> > >
> > > I have been experimenting with training a word tokenizer model on 3 mln
> > > lines of fake corpus. Training procedure is space and memory consuming,
> > the
> > > resulting model is also large and I would like to optimize it by giving
> > > less data.
> > >
> > > Kind regards,
> > > Nikolai KROT
> >
>

Re: Training data sets size for Word Tokenizer and Sentence Detector

Reply via email to