Hi Does anyone have some stats around scalability of how many urls you crawled and how long it took. Definitely these stats are environment based and the site(s) crawled, but would be nice to see some stats here.
I used nutch with HBase and solr and have got a nice working enviroment and so far have been able to crawl a limited set, rather very very limited set of urls satisfactorily. Now that I have a proof of concept, I want to run it full blown, but before I do that, I want to see if my setup can even handle this. If not, I want to see how I can throttle my runs. So some stats/test results would be nice to have. Regards Hemant

