RE: Best and economical way of setting hadoop cluster for distributed crawling

2019-11-01 Thread Markus Jelsma
Hello Sachin, You might want to check out the fetcher.* settings in your configuration. They control how many threads in total, how they are queued, what the delay between fetchers is, how many threads per queue etc. Keep in mind, if you do not own the server or have no explicit permission, it

Re: Best and economical way of setting hadoop cluster for distributed crawling

2019-11-01 Thread Sachin Mittal
OK understood. I am using nutch defaults and they are set optimally especially for polite crawling. I am indeed right now crawling just one host and given the defaults the throughput is what it should be. Yes one need not to be aggressive here and just be patient. I think no where in near future

Re: Best and economical way of setting hadoop cluster for distributed crawling

2019-11-01 Thread Sebastian Nagel
Hi Sachin, > What I have observed is that it usually fetches, parses and indexes > 1800 web pages. This means 10 pages per minute. How are the 1800 pages distributed over hosts? The default delay between successive fetches to the same host is 5 seconds. If all pages belong to the same host,