Il giorno 17/set/2013, alle ore 10:18, Jörn Kottmann ha scritto:

> On 09/17/2013 09:53 AM, Giorgio Valoti wrote:
>> <http://www.corpusitaliano.it/en/index.html>  The whole corpus is well over 
>> 9GB. It’s not my plan to analyze the whole thing, of course! Do you think 
>> would it realistic to use the evaluation tool to decide a reasonable size 
>> for the corpus? I’m not an expert, but I guess there’s no point in analyzed 
>> that many data if you can achieve a good enough accuracy with a much smaller 
>> sample, right?
> 
> The model performance depends on the quality of your training data, The 
> description says that the corpus is in part manually corrected for 
> annotations. I would suggest to only train on these parts if possible, 
> because the other parts are probably less accurate.

Unfortunately, it seems there’s no way to which part are manually corrected. :( 
I’ve contacted the site; we’ll see.

> 
> Depending on the performance of the model on your data, you could annotate 
> some of your documents and add them to the training data, this usually helps 
> a lot.


--
Giorgio Valoti

Reply via email to