RE: sizing guide

Markus Jelsma Thu, 13 Feb 2014 03:17:49 -0800
Increase the number of mappers and reducers per node, see mapred-site.xml. 
 
-----Original message-----
> From:Deepa Jayaveer <[email protected]>
> Sent: Thursday 13th February 2014 11:58
> To: [email protected]
> Cc: [email protected]
> Subject: RE: sizing guide
> 
> Hi,
> How to make smaller mapper /reducer units  ? -is it making less number of 
> URLs in seed,txt? 
> 
> 
> Thanks and Regards
> Deepa Devi Jayaveer
> 
> 
> 
> 
> From:
> Markus Jelsma <[email protected]>
> To:
> [email protected] <[email protected]>
> Date:
> 02/13/2014 02:54 PM
> Subject:
> RE: sizing guide
> 
> 
> 
> Hi,
> 
> 10GB heap is a complete waste of memory and resources. 500MB heap is most 
> cases enough. It is better to have more small mappers/reducers than a few 
> large units. Also, 64GB of RAM per datanode/tasktracker is too much (Nutch 
> is not a long running process and does not benefit from a large heap or a 
> lot of OS disk cache), unless you also have 64 CPU cores available. A rule 
> of thumb of mine is to allocate one CPU core and 500-1000MB RAM per slot. 
> 
> Cheers 
> 
>  
>  
> -----Original message-----
> > From:Deepa Jayaveer <[email protected]>
> > Sent: Thursday 13th February 2014 8:09
> > To: [email protected]
> > Cc: [email protected]
> > Subject: Re: sizing guide
> > 
> > Thanks for your reply.
> >   I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with 
> > Hbase 
> > once I get a fair idea about Nutch.
> >                 For our use case, I need to crawl large documents for 
> > around 100 web sites
> >  weekly  and our functionality demands to crawl on daily basis or even 
> > hourly basis to 
> > extract specific information from around 20 different host. Say, 
> > Need to extract product details from the retailer's site. 
> > In that case, we need to recrawl the pages to get the latest information
> > 
> > As you mentioned, I can do a batch delete the crawled html data once
> > I extract the information from the crawled data. I can expect the 
> > crawled data roughly to be around  1 TB (could be deleted on scheduled 
> > basis)
> > 
> > Will these sizing be fine for Nutch installation in production?
> > 4 Node Hadoop cluster with 2 TB storage each
> > 64 GB RAM each
> > 10 GB heap
> > 
> > Apart from that, need to do HBase data sizing to store the product 
> > details(which
> > would be around 400 GB of data) 
> > can I use the same HBase cluster to store the extracted data where Nutch 
> 
> > is raining 
> > 
> > Can you please let me know your suggestion or recommendations.
> > 
> > 
> > Thanks and Regards
> > Deepa Devi Jayaveer
> > Mobile No: 9940662806
> > Tata Consultancy Services
> > Mailto: [email protected]
> > Website: http://www.tcs.com
> > ____________________________________________
> > Experience certainty.   IT Services
> >                         Business Solutions
> >                         Consulting
> > ____________________________________________
> > 
> > 
> > 
> > From:
> > Tejas Patil <[email protected]>
> > To:
> > "[email protected]" <[email protected]>
> > Date:
> > 02/13/2014 05:58 AM
> > Subject:
> > Re: sizing guide
> > 
> > 
> > 
> > If you are looking for specific Nutch 2.1 + MySQL combination, I think 
> > that
> > there won;t be any on the project wiki.
> > 
> > There is no perfect answer for this as it depends on these factors (this
> > list may go on):
> > - Nature of data that you are crawling: small html files or large 
> > documents.
> > - Is it a continuous crawl or few levels ?
> > - Are you re-crawling urls ?
> > - How big is the crawl space ?
> > - Is it a intranet crawl ? How frequently are the pages changed ?
> > 
> > Nutch 1.x would be a perfect fit for prod level crawls. If you still 
> want
> > to use Nutch 2.x, it would be better to switch to some other datastore 
> > (eg.
> > HBase).
> > 
> > Below are my experiences with two use cases wherein Nutch was used over
> > prod with Nutch 1.x:
> > 
> > (A) Targeted crawl of a single host
> > In this case I wanted to get the data crawled quickly and didn't bother
> > about the updates that would happen to the pages. I started off with a 
> > five
> > node Hadoop cluster but later did the math that it won't get my work 
> done
> > in few days (remember that you need to have a delay between successive
> > requests which the server agrees on else your crawler is banned). Later 
> I
> > bumped the cluster to 15 nodes. The pages were HTML files with size 
> > roughly
> > 200k. The crawled data roughly needed 200GB and I had storage of about
> > 500GB.
> > 
> > (B) Open crawl of several hosts
> > The configs and memory settings were driven by the prod hardware. I had 
> a 
> > 4
> > node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
> > hadoop job with an exception of generate job which needed more heap 
> (8-10
> > GB). There was no need to store the crawled data and every batch was
> > deleted as soon as it was processed. That said that disk had a capacity 
> of
> > 2 TB.
> > 
> > Thanks,
> > Tejas
> > 
> > On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer 
> > <[email protected]>wrote:
> > 
> > > Hi ,
> > > Am using Nutch2.1 with MySQL. Is there a sizing guide available for 
> > Nutch
> > > 2.1?
> > > Is there any recommendations could be ginven on  sizing memory,CPU and
> > > Disk Space for crawling.
> > >
> > > Thanks and Regards
> > > Deepa Devi Jayaveer
> > > Mobile No: 9940662806
> > > Tata Consultancy Services
> > > Mailto: [email protected]
> > > Website: http://www.tcs.com
> > > ____________________________________________
> > > Experience certainty.   IT Services
> > >                         Business Solutions
> > >                         Consulting
> > > ____________________________________________
> > > =====-----=====-----=====
> > > Notice: The information contained in this e-mail
> > > message and/or attachments to it may contain
> > > confidential or privileged information. If you are
> > > not the intended recipient, any dissemination, use,
> > > review, distribution, printing or copying of the
> > > information contained in this e-mail message
> > > and/or attachments to it are strictly prohibited. If
> > > you have received this communication in error,
> > > please notify us by reply e-mail or telephone and
> > > immediately and permanently delete the message
> > > and any attachments. Thank you
> > >
> > >
> > >
> > 
> > 
> > 
> 
> 
>
RE: sizing guide

Reply via email to