If you are looking for specific Nutch 2.1 + MySQL combination, I think that there won;t be any on the project wiki.
There is no perfect answer for this as it depends on these factors (this list may go on): - Nature of data that you are crawling: small html files or large documents. - Is it a continuous crawl or few levels ? - Are you re-crawling urls ? - How big is the crawl space ? - Is it a intranet crawl ? How frequently are the pages changed ? Nutch 1.x would be a perfect fit for prod level crawls. If you still want to use Nutch 2.x, it would be better to switch to some other datastore (eg. HBase). Below are my experiences with two use cases wherein Nutch was used over prod with Nutch 1.x: (A) Targeted crawl of a single host In this case I wanted to get the data crawled quickly and didn't bother about the updates that would happen to the pages. I started off with a five node Hadoop cluster but later did the math that it won't get my work done in few days (remember that you need to have a delay between successive requests which the server agrees on else your crawler is banned). Later I bumped the cluster to 15 nodes. The pages were HTML files with size roughly 200k. The crawled data roughly needed 200GB and I had storage of about 500GB. (B) Open crawl of several hosts The configs and memory settings were driven by the prod hardware. I had a 4 node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every hadoop job with an exception of generate job which needed more heap (8-10 GB). There was no need to store the crawled data and every batch was deleted as soon as it was processed. That said that disk had a capacity of 2 TB. Thanks, Tejas On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer <[email protected]>wrote: > Hi , > Am using Nutch2.1 with MySQL. Is there a sizing guide available for Nutch > 2.1? > Is there any recommendations could be ginven on sizing memory,CPU and > Disk Space for crawling. > > Thanks and Regards > Deepa Devi Jayaveer > Mobile No: 9940662806 > Tata Consultancy Services > Mailto: [email protected] > Website: http://www.tcs.com > ____________________________________________ > Experience certainty. IT Services > Business Solutions > Consulting > ____________________________________________ > =====-----=====-----===== > Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > >

