Hi,
First of all, i am planning to crawl around 600k urls up to depth 2. In
general, I expect around 600 millions pages or urls to be crawled.
However, my constrain is that we need to crawl that as soon as possible, so
time it is critical. The problem that i am facing is how to determine the
right hardware. We plan use Amazon EMR to handle that because we are going
to crawl these websites only once.
Our current configuration is that as follows:
1. master node (m3.xlarge 4 CPU with 15G memory).
2. 10 slaves nodes (m3.xlarge 4 CPU with 15G memory)
I am setting the number of fetchers threads to 400.
With this configuration, i am able to crawl 500k url every 4 hours or so.
When i monitor the time for each phase, the fetch phase is the bottle neck.
As i read in this forum, the fetch phase is IO intensive operation hence it
does not help to have big instances, so
1. To what extent this applies in my case?
2. If i can use small instances, is there any formula to substitute
large instances with small instances?
Thanks in advance
Regards
Ameer Tawfik
--
View this message in context:
http://lucene.472066.n3.nabble.com/Choosing-Amazon-Instance-type-large-vs-small-for-large-scale-crawling-tp4246518.html
Sent from the Nutch - User mailing list archive at Nabble.com.