If you are looking for specific Nutch 2.1 + MySQL combination, I think that
there won;t be any on the project wiki.

There is no perfect answer for this as it depends on these factors (this
list may go on):
- Nature of data that you are crawling: small html files or large documents.
- Is it a continuous crawl or few levels ?
- Are you re-crawling urls ?
- How big is the crawl space ?
- Is it a intranet crawl ? How frequently are the pages changed ?

Nutch 1.x would be a perfect fit for prod level crawls. If you still want
to use Nutch 2.x, it would be better to switch to some other datastore (eg.
HBase).

Below are my experiences with two use cases wherein Nutch was used over
prod with Nutch 1.x:

(A) Targeted crawl of a single host
In this case I wanted to get the data crawled quickly and didn't bother
about the updates that would happen to the pages. I started off with a five
node Hadoop cluster but later did the math that it won't get my work done
in few days (remember that you need to have a delay between successive
requests which the server agrees on else your crawler is banned). Later I
bumped the cluster to 15 nodes. The pages were HTML files with size roughly
200k. The crawled data roughly needed 200GB and I had storage of about
500GB.

(B) Open crawl of several hosts
The configs and memory settings were driven by the prod hardware. I had a 4
node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
hadoop job with an exception of generate job which needed more heap (8-10
GB). There was no need to store the crawled data and every batch was
deleted as soon as it was processed. That said that disk had a capacity of
2 TB.

Thanks,
Tejas

On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer <[email protected]>wrote:

> Hi ,
> Am using Nutch2.1 with MySQL. Is there a sizing guide available for Nutch
> 2.1?
> Is there any recommendations could be ginven on  sizing memory,CPU and
> Disk Space for crawling.
>
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: [email protected]
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty.   IT Services
>                         Business Solutions
>                         Consulting
> ____________________________________________
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>

Reply via email to