RE: sizing guide

Deepa Jayaveer Thu, 13 Feb 2014 02:58:42 -0800

Hi,
How to make smaller mapper /reducer units  ? -is it making less number of 
URLs in seed,txt?



Thanks and Regards
Deepa Devi Jayaveer




From:
Markus Jelsma <[email protected]>
To:
[email protected] <[email protected]>
Date:
02/13/2014 02:54 PM
Subject:
RE: sizing guide



Hi,

10GB heap is a complete waste of memory and resources. 500MB heap is most 
cases enough. It is better to have more small mappers/reducers than a few 
large units. Also, 64GB of RAM per datanode/tasktracker is too much (Nutch 
is not a long running process and does not benefit from a large heap or a 
lot of OS disk cache), unless you also have 64 CPU cores available. A rule 
of thumb of mine is to allocate one CPU core and 500-1000MB RAM per slot. 

Cheers 

 
 
-----Original message-----
> From:Deepa Jayaveer <[email protected]>
> Sent: Thursday 13th February 2014 8:09
> To: [email protected]
> Cc: [email protected]
> Subject: Re: sizing guide
> 
> Thanks for your reply.
>   I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with 
> Hbase 
> once I get a fair idea about Nutch.
>                 For our use case, I need to crawl large documents for 
> around 100 web sites
>  weekly  and our functionality demands to crawl on daily basis or even 
> hourly basis to 
> extract specific information from around 20 different host. Say, 
> Need to extract product details from the retailer's site. 
> In that case, we need to recrawl the pages to get the latest information
> 
> As you mentioned, I can do a batch delete the crawled html data once
> I extract the information from the crawled data. I can expect the 
> crawled data roughly to be around  1 TB (could be deleted on scheduled 
> basis)
> 
> Will these sizing be fine for Nutch installation in production?
> 4 Node Hadoop cluster with 2 TB storage each
> 64 GB RAM each
> 10 GB heap
> 
> Apart from that, need to do HBase data sizing to store the product 
> details(which
> would be around 400 GB of data) 
> can I use the same HBase cluster to store the extracted data where Nutch 

> is raining 
> 
> Can you please let me know your suggestion or recommendations.
> 
> 
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: [email protected]
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty.   IT Services
>                         Business Solutions
>                         Consulting
> ____________________________________________
> 
> 
> 
> From:
> Tejas Patil <[email protected]>
> To:
> "[email protected]" <[email protected]>
> Date:
> 02/13/2014 05:58 AM
> Subject:
> Re: sizing guide
> 
> 
> 
> If you are looking for specific Nutch 2.1 + MySQL combination, I think 
> that
> there won;t be any on the project wiki.
> 
> There is no perfect answer for this as it depends on these factors (this
> list may go on):
> - Nature of data that you are crawling: small html files or large 
> documents.
> - Is it a continuous crawl or few levels ?
> - Are you re-crawling urls ?
> - How big is the crawl space ?
> - Is it a intranet crawl ? How frequently are the pages changed ?
> 
> Nutch 1.x would be a perfect fit for prod level crawls. If you still 
want
> to use Nutch 2.x, it would be better to switch to some other datastore 
> (eg.
> HBase).
> 
> Below are my experiences with two use cases wherein Nutch was used over
> prod with Nutch 1.x:
> 
> (A) Targeted crawl of a single host
> In this case I wanted to get the data crawled quickly and didn't bother
> about the updates that would happen to the pages. I started off with a 
> five
> node Hadoop cluster but later did the math that it won't get my work 
done
> in few days (remember that you need to have a delay between successive
> requests which the server agrees on else your crawler is banned). Later 
I
> bumped the cluster to 15 nodes. The pages were HTML files with size 
> roughly
> 200k. The crawled data roughly needed 200GB and I had storage of about
> 500GB.
> 
> (B) Open crawl of several hosts
> The configs and memory settings were driven by the prod hardware. I had 
a 
> 4
> node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
> hadoop job with an exception of generate job which needed more heap 
(8-10
> GB). There was no need to store the crawled data and every batch was
> deleted as soon as it was processed. That said that disk had a capacity 
of
> 2 TB.
> 
> Thanks,
> Tejas
> 
> On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer 
> <[email protected]>wrote:
> 
> > Hi ,
> > Am using Nutch2.1 with MySQL. Is there a sizing guide available for 
> Nutch
> > 2.1?
> > Is there any recommendations could be ginven on  sizing memory,CPU and
> > Disk Space for crawling.
> >
> > Thanks and Regards
> > Deepa Devi Jayaveer
> > Mobile No: 9940662806
> > Tata Consultancy Services
> > Mailto: [email protected]
> > Website: http://www.tcs.com
> > ____________________________________________
> > Experience certainty.   IT Services
> >                         Business Solutions
> >                         Consulting
> > ____________________________________________
> > =====-----=====-----=====
> > Notice: The information contained in this e-mail
> > message and/or attachments to it may contain
> > confidential or privileged information. If you are
> > not the intended recipient, any dissemination, use,
> > review, distribution, printing or copying of the
> > information contained in this e-mail message
> > and/or attachments to it are strictly prohibited. If
> > you have received this communication in error,
> > please notify us by reply e-mail or telephone and
> > immediately and permanently delete the message
> > and any attachments. Thank you
> >
> >
> >
> 
> 
>

RE: sizing guide

Reply via email to