On 09/17/2013 09:53 AM, Giorgio Valoti wrote:
<http://www.corpusitaliano.it/en/index.html> The whole corpus is well over 9GB. It’s not my plan to analyze the whole thing, of course! Do you think would it realistic to use the evaluation tool to decide a reasonable size for the corpus? I’m not an expert, but I guess there’s no point in analyzed that many data if you can achieve a good enough accuracy with a much smaller sample, right?
The model performance depends on the quality of your training data, The description says that the corpus is in part manually corrected for annotations. I would suggest to only train on these parts if possible, because the other parts are probably less accurate.
Depending on the performance of the model on your data, you could annotate some of your documents and add them to the training data, this usually helps a lot.
Jörn
