OK understood. I am using nutch defaults and they are set optimally especially for polite crawling. I am indeed right now crawling just one host and given the defaults the throughput is what it should be.
Yes one need not to be aggressive here and just be patient. I think no where in near future I would have over 10M urls to crawl for 1000s of host and local crawling is just fine in my case. So I would just continue the way it is right now. Thanks Sachin On Fri, Nov 1, 2019 at 7:36 PM Sebastian Nagel <wastl.na...@googlemail.com.invalid> wrote: > Hi Sachin, > > > What I have observed is that it usually fetches, parses and indexes > > 1800 web pages. > > This means 10 pages per minute. > > How are the 1800 pages distributed over hosts? > > The default delay between successive fetches to the same host is > 5 seconds. If all pages belong to the same host, the crawler is > waiting 50 sec. every minute and the fetching is done in the remaining > 10 sec. > > If you have the explicit permission to access the host(s) aggressively, > you can decrease the delay > (fetcher.server.delay) or even fetch in parallel from the same host > (fetcher.threads.per.queue). > Otherwise, please keep the delay as is and be patient and polite! You also > risk to get blocked by > the web admin. > > > What I have understood here is that in local mode there is only one > > thread doing the fetch? > > No. The number of parallel threads used in bin/crawl is 50. > --num-threads <num_threads> > Number of threads for fetching / sitemap processing [default: 50] > > I can only second Markus: local mode is sufficient unless you're crawling > - significantly more than 10M+ URLs > - from 1000+ domains > > With less domains/hosts there's nothing to distribute because all > URLs of one domain/host are processed in one fetcher task to ensure > politeness. > > Best, > Sebastian > > On 11/1/19 6:53 AM, Sachin Mittal wrote: > > Hi, > > I understood the point. > > I would also like to run nutch on my local machine. > > > > So far I am running in standalone mode with default crawl script where > > fetch time limit is 180 minutes. > > What I have observed is that it usually fetches, parses and indexes 1800 > > web pages. > > I am basically fetching the entire page and fetch process is one that > takes > > maximum time. > > > > I have a i7 processor with 16GB of RAM. > > > > How can I increase the throughput here? > > What I have understood here is that in local mode there is only one > thread > > doing the fetch? > > > > I guess I would need multiple threads running in parallel. > > Would running nutch in pseudo distributed mode and answer here? > > It will then run multiple fetchers and I can increase my throughput. > > > > Please let me know. > > > > Thanks > > Sachin > > > > > > > > > > > > > > On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma < > markus.jel...@openindex.io> > > wrote: > > > >> Hello Sachin, > >> > >> Nutch can run on Amazon AWS without trouble, and probably on any Hadoop > >> based provider. This is the most expensive option you have. > >> > >> Cheaper would be to rent some servers and install Hadoop yourself, > getting > >> it up and running by hand on some servers will take the better part of a > >> day. > >> > >> The cheapest and easiest, and in almost all cases the best option, is > not > >> to run Nutch on Hadoop and stay local. A local Nutch can easily handle a > >> couple of million URLs. So unless you want to crawl many different > domains > >> and expect 10M+ URLs, stay local. > >> > >> When we first started our business almost a decade ago we rented VPSs > >> first and then physical machines. This ran fine for some years but when > we > >> had the option to make some good investments, we bought our own hardware > >> and have been scaling up the cluster ever since. And with the previous > and > >> most recent AMD based servers processing power became increasingly > cheaper. > >> > >> If you need to scale up for long term, getting your own hardware is > indeed > >> the best option. > >> > >> Regards, > >> Markus > >> > >> > >> -----Original message----- > >>> From:Sachin Mittal <sjmit...@gmail.com> > >>> Sent: Tuesday 22nd October 2019 15:59 > >>> To: user@nutch.apache.org > >>> Subject: Best and economical way of setting hadoop cluster for > >> distributed crawling > >>> > >>> Hi, > >>> I have been running nutch in local mode and so far I am able to have a > >> good > >>> understanding on how it all works. > >>> > >>> I wanted to start with distributed crawling using some public cloud > >>> provider. > >>> > >>> I just wanted to know if fellow users have any experience in setting up > >>> nutch for distributed crawling. > >>> > >>> From nutch wiki I have some idea on what hardware requirements should > be. > >>> > >>> I just wanted to know which of the public cloud providers (IaaS or > PaaS) > >>> are good to setup hadoop clusters on. Basically ones on which it is > easy > >> to > >>> setup/manage the cluster and ones which are easy on budget. > >>> > >>> Please let me know if you folks have any insights based on your > >> experiences. > >>> > >>> Thanks and Regards > >>> Sachin > >>> > >> > > > >