Hi Ameer, On Sun, Dec 20, 2015 at 6:09 PM, <[email protected]> wrote:
> > With this configuration, i am able to crawl 500k url every 4 hours or so. > Sounds like reasonable throughput. You should be able to improve this however. > When i monitor the time for each phase, the fetch phase is the bottle > neck. As i read in this forum, the fetch phase is IO intensive operation hence it > does not help to have big instances, Yes fetching is a bottle neck for sure. Large fetch lists with the incorrect URL partitioning scheme can really throttle your throughput so I would advise looking into an appropriate partitioning scheme with many small fetch lists (50-100K for example). Fetching with the protocol-http plugin is not an IO intensive process itself as all that is essentially being done by each thread is a socket connection being opened. This does however change if you use stuff like Selenium to do your fetching and page interaction/rendering. > so > 1. To what extent this applies in my case? > See above > 2. If i can use small instances, is there any formula to substitute > large instances with small instances? > I would encourage the use of a large persistent head node with every other node being a spot instance. This is explained in another thread a few weeks back. http://www.mail-archive.com/user%40nutch.apache.org/msg14054.html > > Thanks in advance > > Regards > Ameer Tawfik > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Choosing-Amazon-Instance-type-large-vs-small-for-large-scale-crawling-tp4246518.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > > -- *Lewis*

