Re: Training data sets size for Word Tokenizer and Sentence Detector

Benedict Holland Mon, 25 Sep 2017 09:36:24 -0700

Hello,

I am almost certain that you will have to pay for data sources. There are a
few that are very reasonable, such as the entire Wikipedia set (roughly 3
billion words) across many languages. I have not found a free one,
particularly for names, and I would be very interested in that possibility.
That said, I believe this is how linguists figured out how to make money
and after an extensive search, we couldn't find a good data source that was
also free.


Thanks,
~Ben

On Mon, Sep 25, 2017 at 10:13 AM, Joern Kottmann <jo...@apache.org> wrote:

> Hello,
>
> you can get good results with something in the range of  10k sentences.
> You should not use fake/generated data for training since that usually
> gives bad results.
>
> For what kind of domain do you train the models? Which languages?
>
> Jörn
>
> On Thu, Sep 21, 2017 at 1:13 PM, Nikolai Krot <tal...@gmail.com> wrote:
> > Hi colleagues,
> >
> > I want to train my own models (possibly on a modified set of features)
> for
> > word tokenization and sentence detection, My question is how much
> training
> > data is required for a reliable model.
> >
> > I have been experimenting with training a word tokenizer model on 3 mln
> > lines of fake corpus. Training procedure is space and memory consuming,
> the
> > resulting model is also large and I would like to optimize it by giving
> > less data.
> >
> > Kind regards,
> > Nikolai KROT
>

Re: Training data sets size for Word Tokenizer and Sentence Detector

Reply via email to