Increase the number of mappers and reducers per node, see mapred-site.xml.
-----Original message-----
> From:Deepa Jayaveer <[email protected]>
> Sent: Thursday 13th February 2014 11:58
> To: [email protected]
> Cc: [email protected]
> Subject: RE: sizing guide
>
> Hi,
> How to make smaller mapper /reducer units ? -is it making less number of
> URLs in seed,txt?
>
>
> Thanks and Regards
> Deepa Devi Jayaveer
>
>
>
>
> From:
> Markus Jelsma <[email protected]>
> To:
> [email protected] <[email protected]>
> Date:
> 02/13/2014 02:54 PM
> Subject:
> RE: sizing guide
>
>
>
> Hi,
>
> 10GB heap is a complete waste of memory and resources. 500MB heap is most
> cases enough. It is better to have more small mappers/reducers than a few
> large units. Also, 64GB of RAM per datanode/tasktracker is too much (Nutch
> is not a long running process and does not benefit from a large heap or a
> lot of OS disk cache), unless you also have 64 CPU cores available. A rule
> of thumb of mine is to allocate one CPU core and 500-1000MB RAM per slot.
>
> Cheers
>
>
>
> -----Original message-----
> > From:Deepa Jayaveer <[email protected]>
> > Sent: Thursday 13th February 2014 8:09
> > To: [email protected]
> > Cc: [email protected]
> > Subject: Re: sizing guide
> >
> > Thanks for your reply.
> > I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with
> > Hbase
> > once I get a fair idea about Nutch.
> > For our use case, I need to crawl large documents for
> > around 100 web sites
> > weekly and our functionality demands to crawl on daily basis or even
> > hourly basis to
> > extract specific information from around 20 different host. Say,
> > Need to extract product details from the retailer's site.
> > In that case, we need to recrawl the pages to get the latest information
> >
> > As you mentioned, I can do a batch delete the crawled html data once
> > I extract the information from the crawled data. I can expect the
> > crawled data roughly to be around 1 TB (could be deleted on scheduled
> > basis)
> >
> > Will these sizing be fine for Nutch installation in production?
> > 4 Node Hadoop cluster with 2 TB storage each
> > 64 GB RAM each
> > 10 GB heap
> >
> > Apart from that, need to do HBase data sizing to store the product
> > details(which
> > would be around 400 GB of data)
> > can I use the same HBase cluster to store the extracted data where Nutch
>
> > is raining
> >
> > Can you please let me know your suggestion or recommendations.
> >
> >
> > Thanks and Regards
> > Deepa Devi Jayaveer
> > Mobile No: 9940662806
> > Tata Consultancy Services
> > Mailto: [email protected]
> > Website: http://www.tcs.com
> > ____________________________________________
> > Experience certainty. IT Services
> > Business Solutions
> > Consulting
> > ____________________________________________
> >
> >
> >
> > From:
> > Tejas Patil <[email protected]>
> > To:
> > "[email protected]" <[email protected]>
> > Date:
> > 02/13/2014 05:58 AM
> > Subject:
> > Re: sizing guide
> >
> >
> >
> > If you are looking for specific Nutch 2.1 + MySQL combination, I think
> > that
> > there won;t be any on the project wiki.
> >
> > There is no perfect answer for this as it depends on these factors (this
> > list may go on):
> > - Nature of data that you are crawling: small html files or large
> > documents.
> > - Is it a continuous crawl or few levels ?
> > - Are you re-crawling urls ?
> > - How big is the crawl space ?
> > - Is it a intranet crawl ? How frequently are the pages changed ?
> >
> > Nutch 1.x would be a perfect fit for prod level crawls. If you still
> want
> > to use Nutch 2.x, it would be better to switch to some other datastore
> > (eg.
> > HBase).
> >
> > Below are my experiences with two use cases wherein Nutch was used over
> > prod with Nutch 1.x:
> >
> > (A) Targeted crawl of a single host
> > In this case I wanted to get the data crawled quickly and didn't bother
> > about the updates that would happen to the pages. I started off with a
> > five
> > node Hadoop cluster but later did the math that it won't get my work
> done
> > in few days (remember that you need to have a delay between successive
> > requests which the server agrees on else your crawler is banned). Later
> I
> > bumped the cluster to 15 nodes. The pages were HTML files with size
> > roughly
> > 200k. The crawled data roughly needed 200GB and I had storage of about
> > 500GB.
> >
> > (B) Open crawl of several hosts
> > The configs and memory settings were driven by the prod hardware. I had
> a
> > 4
> > node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
> > hadoop job with an exception of generate job which needed more heap
> (8-10
> > GB). There was no need to store the crawled data and every batch was
> > deleted as soon as it was processed. That said that disk had a capacity
> of
> > 2 TB.
> >
> > Thanks,
> > Tejas
> >
> > On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer
> > <[email protected]>wrote:
> >
> > > Hi ,
> > > Am using Nutch2.1 with MySQL. Is there a sizing guide available for
> > Nutch
> > > 2.1?
> > > Is there any recommendations could be ginven on sizing memory,CPU and
> > > Disk Space for crawling.
> > >
> > > Thanks and Regards
> > > Deepa Devi Jayaveer
> > > Mobile No: 9940662806
> > > Tata Consultancy Services
> > > Mailto: [email protected]
> > > Website: http://www.tcs.com
> > > ____________________________________________
> > > Experience certainty. IT Services
> > > Business Solutions
> > > Consulting
> > > ____________________________________________
> > > =====-----=====-----=====
> > > Notice: The information contained in this e-mail
> > > message and/or attachments to it may contain
> > > confidential or privileged information. If you are
> > > not the intended recipient, any dissemination, use,
> > > review, distribution, printing or copying of the
> > > information contained in this e-mail message
> > > and/or attachments to it are strictly prohibited. If
> > > you have received this communication in error,
> > > please notify us by reply e-mail or telephone and
> > > immediately and permanently delete the message
> > > and any attachments. Thank you
> > >
> > >
> > >
> >
> >
> >
>
>
>