On Wed, Feb 12, 2014 at 11:08 PM, Deepa Jayaveer <[email protected]>wrote:

> Thanks for your reply.
>   I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with
> Hbase
> once I get a fair idea about Nutch.
>                 For our use case, I need to crawl large documents for
> around 100 web sites
>  weekly  and our functionality demands to crawl on daily basis or even
> hourly basis to
> extract specific information from around 20 different host. Say,
> Need to extract product details from the retailer's site.
> In that case, we need to recrawl the pages to get the latest information
>
> As you mentioned, I can do a batch delete the crawled html data once
> I extract the information from the crawled data. I can expect the
> crawled data roughly to be around  1 TB (could be deleted on scheduled
> basis)
>

If you process the data as soon it is available, then you might not need to
have 1 TB.. unless Nutch gets that much data in a single fetch cycle.

>
> Will these sizing be fine for Nutch installation in production?
> 4 Node Hadoop cluster with 2 TB storage each
> 64 GB RAM each
> 10 GB heap
>

Looks fine. You need to monitor the crawl for first week or two so as to
know if you need to change this setup.

>
> Apart from that, need to do HBase data sizing to store the product
> details(which
> would be around 400 GB of data)
> can I use the same HBase cluster to store the extracted data where Nutch
> is raining
>

Yes you can. HBase is a black box to me and it would have a bunch of its
own configs which you could tune.

>
> Can you please let me know your suggestion or recommendations.
>
>
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: [email protected]
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty.   IT Services
>                         Business Solutions
>                         Consulting
> ____________________________________________
>
>
>
> From:
> Tejas Patil <[email protected]>
> To:
> "[email protected]" <[email protected]>
> Date:
> 02/13/2014 05:58 AM
> Subject:
> Re: sizing guide
>
>
>
> If you are looking for specific Nutch 2.1 + MySQL combination, I think
> that
> there won;t be any on the project wiki.
>
> There is no perfect answer for this as it depends on these factors (this
> list may go on):
> - Nature of data that you are crawling: small html files or large
> documents.
> - Is it a continuous crawl or few levels ?
> - Are you re-crawling urls ?
> - How big is the crawl space ?
> - Is it a intranet crawl ? How frequently are the pages changed ?
>
> Nutch 1.x would be a perfect fit for prod level crawls. If you still want
> to use Nutch 2.x, it would be better to switch to some other datastore
> (eg.
> HBase).
>
> Below are my experiences with two use cases wherein Nutch was used over
> prod with Nutch 1.x:
>
> (A) Targeted crawl of a single host
> In this case I wanted to get the data crawled quickly and didn't bother
> about the updates that would happen to the pages. I started off with a
> five
> node Hadoop cluster but later did the math that it won't get my work done
> in few days (remember that you need to have a delay between successive
> requests which the server agrees on else your crawler is banned). Later I
> bumped the cluster to 15 nodes. The pages were HTML files with size
> roughly
> 200k. The crawled data roughly needed 200GB and I had storage of about
> 500GB.
>
> (B) Open crawl of several hosts
> The configs and memory settings were driven by the prod hardware. I had a
> 4
> node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
> hadoop job with an exception of generate job which needed more heap (8-10
> GB). There was no need to store the crawled data and every batch was
> deleted as soon as it was processed. That said that disk had a capacity of
> 2 TB.
>
> Thanks,
> Tejas
>
> On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer
> <[email protected]>wrote:
>
> > Hi ,
> > Am using Nutch2.1 with MySQL. Is there a sizing guide available for
> Nutch
> > 2.1?
> > Is there any recommendations could be ginven on  sizing memory,CPU and
> > Disk Space for crawling.
> >
> > Thanks and Regards
> > Deepa Devi Jayaveer
> > Mobile No: 9940662806
> > Tata Consultancy Services
> > Mailto: [email protected]
> > Website: http://www.tcs.com
> > ____________________________________________
> > Experience certainty.   IT Services
> >                         Business Solutions
> >                         Consulting
> > ____________________________________________
> > =====-----=====-----=====
> > Notice: The information contained in this e-mail
> > message and/or attachments to it may contain
> > confidential or privileged information. If you are
> > not the intended recipient, any dissemination, use,
> > review, distribution, printing or copying of the
> > information contained in this e-mail message
> > and/or attachments to it are strictly prohibited. If
> > you have received this communication in error,
> > please notify us by reply e-mail or telephone and
> > immediately and permanently delete the message
> > and any attachments. Thank you
> >
> >
> >
>
>
>

Reply via email to