Hi Ben, Science, often driven by grants, should offer its fruit free of charge. My personal opinion.
Wikipedia is a great resource. Both tokenization corpus and NER corpus can to some extent be induced from parallel texts plus a bilingual dictionary for common words. It is just time consuming. Best, Nikolai On Mon, Sep 25, 2017 at 6:36 PM, Benedict Holland < benedict.m.holl...@gmail.com> wrote: > Hello, > > I am almost certain that you will have to pay for data sources. There are a > few that are very reasonable, such as the entire Wikipedia set (roughly 3 > billion words) across many languages. I have not found a free one, > particularly for names, and I would be very interested in that possibility. > That said, I believe this is how linguists figured out how to make money > and after an extensive search, we couldn't find a good data source that was > also free. > > Thanks, > ~Ben > > On Mon, Sep 25, 2017 at 10:13 AM, Joern Kottmann <jo...@apache.org> wrote: > > > Hello, > > > > you can get good results with something in the range of 10k sentences. > > You should not use fake/generated data for training since that usually > > gives bad results. > > > > For what kind of domain do you train the models? Which languages? > > > > Jörn > > > > On Thu, Sep 21, 2017 at 1:13 PM, Nikolai Krot <tal...@gmail.com> wrote: > > > Hi colleagues, > > > > > > I want to train my own models (possibly on a modified set of features) > > for > > > word tokenization and sentence detection, My question is how much > > training > > > data is required for a reliable model. > > > > > > I have been experimenting with training a word tokenizer model on 3 mln > > > lines of fake corpus. Training procedure is space and memory consuming, > > the > > > resulting model is also large and I would like to optimize it by giving > > > less data. > > > > > > Kind regards, > > > Nikolai KROT > > >