Hi, Nutch can easily scale to many many billions of records, it just depends on how many and how powerful your nodes are. Crawl speed is not very relevant as it is always very fast, the problem usually is updating the databases. If you spread your data over more machines you will increase your throughput! We can easily manage 2m records on a very small 1 core 1 GB VPS but we can also manage dozens of billions records on a small cluster of 5 16 core 16GB nodes. It depends on your cluster!
Cheers, Markus -----Original message----- > From:h b <[email protected]> > Sent: Tuesday 2nd July 2013 7:35 > To: [email protected] > Subject: Nutch scalability tests > > Hi > Does anyone have some stats around scalability of how many urls you crawled > and how long it took. Definitely these stats are environment based and the > site(s) crawled, but would be nice to see > some stats here. > > I used nutch with HBase and solr and have got a nice working enviroment and > so far have been able to crawl a limited set, rather very very limited set > of urls satisfactorily. Now that I have a proof of concept, I want to run > it full blown, but before I do that, I want to see if my setup can even > handle this. If not, I want to see how I can throttle my runs. So some > stats/test results would be nice to have. > > > Regards > Hemant >

