subject:"Best and economical way of setting hadoop cluster for distributed crawling"

Re: Best and economical way of setting hadoop cluster for distributed crawling

2019-11-01 Thread Sachin Mittal

OK understood. I am using nutch defaults and they are set optimally especially for polite crawling. I am indeed right now crawling just one host and given the defaults the throughput is what it should be. Yes one need not to be aggressive here and just be patient. I think no where in near future

Re: Best and economical way of setting hadoop cluster for distributed crawling

2019-11-01 Thread Sebastian Nagel

Hi Sachin, > What I have observed is that it usually fetches, parses and indexes > 1800 web pages. This means 10 pages per minute. How are the 1800 pages distributed over hosts? The default delay between successive fetches to the same host is 5 seconds. If all pages belong to the same host,

RE: Best and economical way of setting hadoop cluster for distributed crawling

2019-11-01 Thread Markus Jelsma

Hello Sachin, You might want to check out the fetcher.* settings in your configuration. They control how many threads in total, how they are queued, what the delay between fetchers is, how many threads per queue etc. Keep in mind, if you do not own the server or have no explicit permission, it

Re: Best and economical way of setting hadoop cluster for distributed crawling

2019-10-31 Thread Sachin Mittal

Hi, I understood the point. I would also like to run nutch on my local machine. So far I am running in standalone mode with default crawl script where fetch time limit is 180 minutes. What I have observed is that it usually fetches, parses and indexes 1800 web pages. I am basically fetching the

RE: Best and economical way of setting hadoop cluster for distributed crawling

2019-10-30 Thread Markus Jelsma

Hello Sachin, Nutch can run on Amazon AWS without trouble, and probably on any Hadoop based provider. This is the most expensive option you have. Cheaper would be to rent some servers and install Hadoop yourself, getting it up and running by hand on some servers will take the better part of a

Best and economical way of setting hadoop cluster for distributed crawling

2019-10-22 Thread Sachin Mittal

Hi, I have been running nutch in local mode and so far I am able to have a good understanding on how it all works. I wanted to start with distributed crawling using some public cloud provider. I just wanted to know if fellow users have any experience in setting up nutch for distributed crawling.

Re: Best and economical way of setting hadoop cluster for distributed crawling

Re: Best and economical way of setting hadoop cluster for distributed crawling

RE: Best and economical way of setting hadoop cluster for distributed crawling

Re: Best and economical way of setting hadoop cluster for distributed crawling

RE: Best and economical way of setting hadoop cluster for distributed crawling

Best and economical way of setting hadoop cluster for distributed crawling

6 matches

Site Navigation

Mail list logo

Footer information