Hello, I am almost certain that you will have to pay for data sources. There are a few that are very reasonable, such as the entire Wikipedia set (roughly 3 billion words) across many languages. I have not found a free one, particularly for names, and I would be very interested in that possibility. That said, I believe this is how linguists figured out how to make money and after an extensive search, we couldn't find a good data source that was also free.
Thanks, ~Ben On Mon, Sep 25, 2017 at 10:13 AM, Joern Kottmann <jo...@apache.org> wrote: > Hello, > > you can get good results with something in the range of 10k sentences. > You should not use fake/generated data for training since that usually > gives bad results. > > For what kind of domain do you train the models? Which languages? > > Jörn > > On Thu, Sep 21, 2017 at 1:13 PM, Nikolai Krot <tal...@gmail.com> wrote: > > Hi colleagues, > > > > I want to train my own models (possibly on a modified set of features) > for > > word tokenization and sentence detection, My question is how much > training > > data is required for a reliable model. > > > > I have been experimenting with training a word tokenizer model on 3 mln > > lines of fake corpus. Training procedure is space and memory consuming, > the > > resulting model is also large and I would like to optimize it by giving > > less data. > > > > Kind regards, > > Nikolai KROT >