Il giorno 17/set/2013, alle ore 10:18, Jörn Kottmann ha scritto: > On 09/17/2013 09:53 AM, Giorgio Valoti wrote: >> <http://www.corpusitaliano.it/en/index.html> The whole corpus is well over >> 9GB. It’s not my plan to analyze the whole thing, of course! Do you think >> would it realistic to use the evaluation tool to decide a reasonable size >> for the corpus? I’m not an expert, but I guess there’s no point in analyzed >> that many data if you can achieve a good enough accuracy with a much smaller >> sample, right? > > The model performance depends on the quality of your training data, The > description says that the corpus is in part manually corrected for > annotations. I would suggest to only train on these parts if possible, > because the other parts are probably less accurate.
Unfortunately, it seems there’s no way to which part are manually corrected. :( I’ve contacted the site; we’ll see. > > Depending on the performance of the model on your data, you could annotate > some of your documents and add them to the training data, this usually helps > a lot. -- Giorgio Valoti
