On 22 June 2014 22:07, Meraj A. Khan <[email protected]> wrote: > > Hello Folks, > > I have noticed that Nutch resources and mailing lists are mostly geared > towards the usage of Nutch in research oriented projects , I would like to > know from those of you who are using Nutch in production for large scale > crawling (vertical or non-vertical) about what challenges to expect and how > to overcome them. > > I will list a few challenges that I faced below and would like to hear > from if you faced these challenges you on how you overcame these. > > > 1. If I were to go for a vertical search engine for websites in a > particular domain and follow the crawl-delay directive for politeness in > the robots.txt , there is a possibility that the web master could still > block my IP address and I start getting HTTP 403 forbidden/access denied > messages. How can I overcome these kind of issues , other than providing > full contact info in the nutch-site.xml for the web master to get in touch > with me, before blocking me ?.
Er, providing full access info. is just basic politeness, and IMHO should become a requirement for Nutch. If you are going to hit some sites particularly hard, with good reasons, try contacting the website administrators and explaining to them why you need such access. We both administer, and crawl sites, and as an administrator I am quite willing to accept reasonable requests. After all, it is also our goal to promote our websites, and already most traffic on the web is through search engines. > 2. The fact that you will be considered as just another Nutch variant by > web master puts you at a great level of dis-advantage , where you could be > blocked from accessing the web site at the whims of the web master. Not sure what you mean by "just another Nutch variant", nor why you think that it puts you at a disadvantage. Disadvantage compared to whom? Also, "whims of the web master"? Really? After all, it is their resources that you are using, and they are perfectly within their rights to ban you if they feel, for whatever reason, that you are abusing such resources. > 3. Can anyone share info as to how they overcame this issue when they > were starting out , did you establish a relationship with each website > owner/master to allows unhindered access ? > 4. Any other tips and suggestions would also be greatly appreciated. Sorry if I am misreading the above, but what you are asking for smells like trying to circumvent reasonable requirements. Yes, do try talking to website administrators. You might find them to be surprisingly accommodating if you are reasonable in return. Regards, Gora

