Hi,

Nutch can easily scale to many many billions of records, it just depends on how 
many and how powerful your nodes are. Crawl speed is not very relevant as it is 
always very fast, the problem usually is updating the databases. If you spread 
your data over more machines you will increase your throughput! We can easily 
manage 2m records on a very small 1 core 1 GB VPS but we can also manage dozens 
of billions records on a small cluster of 5 16 core 16GB nodes. It depends on 
your cluster!

Cheers,
Markus

 
 
-----Original message-----
> From:h b <[email protected]>
> Sent: Tuesday 2nd July 2013 7:35
> To: [email protected]
> Subject: Nutch scalability tests
> 
> Hi
> Does anyone have some stats around scalability of how many urls you crawled
> and how long it took. Definitely these stats are environment based and the
> site(s) crawled, but would be nice to see
>  some stats here.
> 
> I used nutch with HBase and solr and have got a nice working enviroment and
> so far have been able to crawl a limited set, rather very very limited set
> of urls satisfactorily. Now that I have a proof of concept, I want to run
> it full blown, but before I do that, I want to see if my setup can even
> handle this. If not, I want to see how I can throttle my runs. So some
> stats/test results would be nice to have.
> 
> 
> Regards
> Hemant
> 

Reply via email to