Hi Ameer,

On Sun, Dec 20, 2015 at 6:09 PM, <[email protected]> wrote:

>
> With this configuration, i am able to crawl 500k url every 4 hours or so.
>

Sounds like reasonable throughput. You should be able to improve this
however.


> When i monitor the time for each phase, the fetch phase is the bottle
> neck.

As i read in this forum, the fetch phase is IO intensive operation hence it
> does not help to have big instances,


Yes fetching is a bottle neck for sure. Large fetch lists with the
incorrect URL partitioning scheme can really throttle your throughput so I
would advise looking into an appropriate partitioning scheme with many
small fetch lists (50-100K for example). Fetching with the protocol-http
plugin is not an IO intensive process itself as all that is essentially
being done by each thread is a socket connection being opened. This does
however change if you use stuff like Selenium to do your fetching and page
interaction/rendering.


> so
>     1. To what extent this applies in my case?
>

See above


>     2. If i can use small instances, is there any formula to substitute
> large instances with small instances?
>

I would encourage the use of a large persistent head node with every other
node being a spot instance. This is explained in another thread a few weeks
back.
http://www.mail-archive.com/user%40nutch.apache.org/msg14054.html


>
> Thanks in advance
>
> Regards
> Ameer Tawfik
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Choosing-Amazon-Instance-type-large-vs-small-for-large-scale-crawling-tp4246518.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>


-- 
*Lewis*

Reply via email to