Thanks for your reply.
I started off PoC with Nutch-MySQL. Planned to move to Nutch 2.1 with
Hbase
once I get a fair idea about Nutch.
For our use case, I need to crawl large documents for
around 100 web sites
weekly and our functionality demands to crawl on daily basis or even
hourly basis to
extract specific information from around 20 different host. Say,
Need to extract product details from the retailer's site.
In that case, we need to recrawl the pages to get the latest information
As you mentioned, I can do a batch delete the crawled html data once
I extract the information from the crawled data. I can expect the
crawled data roughly to be around 1 TB (could be deleted on scheduled
basis)
Will these sizing be fine for Nutch installation in production?
4 Node Hadoop cluster with 2 TB storage each
64 GB RAM each
10 GB heap
Apart from that, need to do HBase data sizing to store the product
details(which
would be around 400 GB of data)
can I use the same HBase cluster to store the extracted data where Nutch
is raining
Can you please let me know your suggestion or recommendations.
Thanks and Regards
Deepa Devi Jayaveer
Mobile No: 9940662806
Tata Consultancy Services
Mailto: [email protected]
Website: http://www.tcs.com
____________________________________________
Experience certainty. IT Services
Business Solutions
Consulting
____________________________________________
From:
Tejas Patil <[email protected]>
To:
"[email protected]" <[email protected]>
Date:
02/13/2014 05:58 AM
Subject:
Re: sizing guide
If you are looking for specific Nutch 2.1 + MySQL combination, I think
that
there won;t be any on the project wiki.
There is no perfect answer for this as it depends on these factors (this
list may go on):
- Nature of data that you are crawling: small html files or large
documents.
- Is it a continuous crawl or few levels ?
- Are you re-crawling urls ?
- How big is the crawl space ?
- Is it a intranet crawl ? How frequently are the pages changed ?
Nutch 1.x would be a perfect fit for prod level crawls. If you still want
to use Nutch 2.x, it would be better to switch to some other datastore
(eg.
HBase).
Below are my experiences with two use cases wherein Nutch was used over
prod with Nutch 1.x:
(A) Targeted crawl of a single host
In this case I wanted to get the data crawled quickly and didn't bother
about the updates that would happen to the pages. I started off with a
five
node Hadoop cluster but later did the math that it won't get my work done
in few days (remember that you need to have a delay between successive
requests which the server agrees on else your crawler is banned). Later I
bumped the cluster to 15 nodes. The pages were HTML files with size
roughly
200k. The crawled data roughly needed 200GB and I had storage of about
500GB.
(B) Open crawl of several hosts
The configs and memory settings were driven by the prod hardware. I had a
4
node hadoop cluster with 64 GB RAM each. 4 GB heap configured for every
hadoop job with an exception of generate job which needed more heap (8-10
GB). There was no need to store the crawled data and every batch was
deleted as soon as it was processed. That said that disk had a capacity of
2 TB.
Thanks,
Tejas
On Wed, Feb 12, 2014 at 1:01 AM, Deepa Jayaveer
<[email protected]>wrote:
> Hi ,
> Am using Nutch2.1 with MySQL. Is there a sizing guide available for
Nutch
> 2.1?
> Is there any recommendations could be ginven on sizing memory,CPU and
> Disk Space for crawling.
>
> Thanks and Regards
> Deepa Devi Jayaveer
> Mobile No: 9940662806
> Tata Consultancy Services
> Mailto: [email protected]
> Website: http://www.tcs.com
> ____________________________________________
> Experience certainty. IT Services
> Business Solutions
> Consulting
> ____________________________________________
> =====-----=====-----=====
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>
>