Hi, I understood the point. I would also like to run nutch on my local machine.
So far I am running in standalone mode with default crawl script where fetch time limit is 180 minutes. What I have observed is that it usually fetches, parses and indexes 1800 web pages. I am basically fetching the entire page and fetch process is one that takes maximum time. I have a i7 processor with 16GB of RAM. How can I increase the throughput here? What I have understood here is that in local mode there is only one thread doing the fetch? I guess I would need multiple threads running in parallel. Would running nutch in pseudo distributed mode and answer here? It will then run multiple fetchers and I can increase my throughput. Please let me know. Thanks Sachin On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma <markus.jel...@openindex.io> wrote: > Hello Sachin, > > Nutch can run on Amazon AWS without trouble, and probably on any Hadoop > based provider. This is the most expensive option you have. > > Cheaper would be to rent some servers and install Hadoop yourself, getting > it up and running by hand on some servers will take the better part of a > day. > > The cheapest and easiest, and in almost all cases the best option, is not > to run Nutch on Hadoop and stay local. A local Nutch can easily handle a > couple of million URLs. So unless you want to crawl many different domains > and expect 10M+ URLs, stay local. > > When we first started our business almost a decade ago we rented VPSs > first and then physical machines. This ran fine for some years but when we > had the option to make some good investments, we bought our own hardware > and have been scaling up the cluster ever since. And with the previous and > most recent AMD based servers processing power became increasingly cheaper. > > If you need to scale up for long term, getting your own hardware is indeed > the best option. > > Regards, > Markus > > > -----Original message----- > > From:Sachin Mittal <sjmit...@gmail.com> > > Sent: Tuesday 22nd October 2019 15:59 > > To: user@nutch.apache.org > > Subject: Best and economical way of setting hadoop cluster for > distributed crawling > > > > Hi, > > I have been running nutch in local mode and so far I am able to have a > good > > understanding on how it all works. > > > > I wanted to start with distributed crawling using some public cloud > > provider. > > > > I just wanted to know if fellow users have any experience in setting up > > nutch for distributed crawling. > > > > From nutch wiki I have some idea on what hardware requirements should be. > > > > I just wanted to know which of the public cloud providers (IaaS or PaaS) > > are good to setup hadoop clusters on. Basically ones on which it is easy > to > > setup/manage the cluster and ones which are easy on budget. > > > > Please let me know if you folks have any insights based on your > experiences. > > > > Thanks and Regards > > Sachin > > >