Hi, 10GB heap is a complete waste of memory and resources. 500MB heap is most cases enough. It is better to have more small mappers/reducers than a few large units. Also, 64GB of RAM per datanode/tasktracker is too much (Nutch is not a long running process and does not benefit from a large heap or a lot of OS disk cache), unless you also have 64 CPU cores available. A rule of thumb of mine is to allocate one CPU core and 500-1000MB RAM per slot.
Cheers -----Original message----- > From:Deepa Jayaveer <[email protected]> > Sent: Thursday 13th February 2014 8:09 > To: [email protected] > Cc: [email protected] > Subject: Re: sizing guide > > Thanks for your reply. > I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with > Hbase > once I get a fair idea about Nutch. > For our use case, I need to crawl large documents for > around 100 web sites > weekly and our functionality demands to crawl on daily basis or even > hourly basis to > extract specific information from around 20 different host. Say, > Need to extract product details from the retailer's site. > In that case, we need to recrawl the pages to get the latest information > > As you mentioned, I can do a batch delete the crawled html data once > I extract the information from the crawled data. I can expect the > crawled data roughly to be around 1 TB (could be deleted on scheduled > basis) > > Will these sizing be fine for Nutch installation in production? > 4 Node Hadoop cluster with 2 TB storage each > 64 GB RAM each > 10 GB heap > > Apart from that, need to do HBase data sizing to store the product > details(which > would be around 400 GB of data) > can I use the same HBase cluster to store the extracted data where Nutch > is raining > > Can you please let me know your suggestion or recommendations. > > > Thanks and Regards > Deepa Devi Jayaveer > Mobile No: 9940662806 > Tata Consultancy Services > Mailto: [email protected] > Website: http://www.tcs.com > ____________________________________________ > Experience certainty. IT Services > Business Solutions > Consulting > ____________________________________________ > > > > From: > Tejas Patil <[email protected]> > To: > "[email protected]" <[email protected]> > Date: > 02/13/2014 05:58 AM > Subject: > Re: sizing guide > > > > If you are looking for specific Nutch 2.1 + MySQL combination, I think > that > there won;t be any on the project wiki. > > There is no perfect answer for this as it depends on these factors (this > list may go on): > - Nature of data that you are crawling: small html files or large > documents. > - Is it a continuous crawl or few levels ? > - Are you re-crawling urls ? > - How big is the crawl space ? > - Is it a intranet crawl ? How frequently are the pages changed ? > > Nutch 1.x would be a perfect fit for prod level crawls. If you still want > to use Nutch 2.x, it would be better to switch to some other datastore > (eg. > HBase). > > Below are my experiences with two use cases wherein Nutch was used over > prod with Nutch 1.x: > > (A) Targeted crawl of a single host > In this case I wanted to get the data crawled quickly and didn't bother > about the updates that would happen to the pages. I started off with a > five > node Hadoop cluster but later did the math that it won't get my work done > in few days (remember that you need to have a delay between successive > requests which the server agrees on else your crawler is banned). Later I > bumped the cluster to 15 nodes. The pages were HTML files with size > roughly > 200k. The crawled data roughly needed 200GB and I had storage of about > 500GB. > > (B) Open crawl of several hosts > The configs and memory settings were driven by the prod hardware. I had a > 4 > node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every > hadoop job with an exception of generate job which needed more heap (8-10 > GB). There was no need to store the crawled data and every batch was > deleted as soon as it was processed. That said that disk had a capacity of > 2 TB. > > Thanks, > Tejas > > On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer > <[email protected]>wrote: > > > Hi , > > Am using Nutch2.1 with MySQL. Is there a sizing guide available for > Nutch > > 2.1? > > Is there any recommendations could be ginven on sizing memory,CPU and > > Disk Space for crawling. > > > > Thanks and Regards > > Deepa Devi Jayaveer > > Mobile No: 9940662806 > > Tata Consultancy Services > > Mailto: [email protected] > > Website: http://www.tcs.com > > ____________________________________________ > > Experience certainty. IT Services > > Business Solutions > > Consulting > > ____________________________________________ > > =====-----=====-----===== > > Notice: The information contained in this e-mail > > message and/or attachments to it may contain > > confidential or privileged information. If you are > > not the intended recipient, any dissemination, use, > > review, distribution, printing or copying of the > > information contained in this e-mail message > > and/or attachments to it are strictly prohibited. If > > you have received this communication in error, > > please notify us by reply e-mail or telephone and > > immediately and permanently delete the message > > and any attachments. Thank you > > > > > > > > >

