Why are you assuming that the web masters are effectively going to block you? In my experience this is the least probable escenario.
On Jun 22, 2014, at 4:14 PM, Meraj A. Khan <[email protected]> wrote: > Gora, > > Thanks for sharing your admin perspective , rest assured I am not trying > to circumvent any politeness requirements in any way , as I mentioned > earlier , I am with in the crawl-delay limits that are being set by the web > masters if any , however , you have confirmed my hunch that I might have to > reach out to individual webmasters to try and convince them to not block my > IP address . > > Even if I have as small a number as 100 web sites to crawl , it would be a > huge challenge for us to communicate with each and every webmaster , how > would one go about doing that ? Also is there a standard way the web > masters list their contact info so as to sell them the pitch to or persuade > them to allows us to crawl their websites at a reasonable frequency? > > By being at a disadvantage , I meant at a disadvantage compared to major > players like Google, Bing and Yahoo bots , whom the webmasters probably > would not block access, and by Nutch variant , I meant an instance of a > customized crawler based on Nutch. > > Thanks. > > > On Sun, Jun 22, 2014 at 1:33 PM, Gora Mohanty <[email protected]> wrote: > >> On 22 June 2014 22:07, Meraj A. Khan <[email protected]> wrote: >>> >>> Hello Folks, >>> >>> I have noticed that Nutch resources and mailing lists are mostly geared >>> towards the usage of Nutch in research oriented projects , I would like >> to >>> know from those of you who are using Nutch in production for large scale >>> crawling (vertical or non-vertical) about what challenges to expect and >> how >>> to overcome them. >>> >>> I will list a few challenges that I faced below and would like to hear >>> from if you faced these challenges you on how you overcame these. >>> >>> >>> 1. If I were to go for a vertical search engine for websites in a >>> particular domain and follow the crawl-delay directive for >> politeness in >>> the robots.txt , there is a possibility that the web master could >> still >>> block my IP address and I start getting HTTP 403 forbidden/access >> denied >>> messages. How can I overcome these kind of issues , other than >> providing >>> full contact info in the nutch-site.xml for the web master to get in >> touch >>> with me, before blocking me ?. >> >> Er, providing full access info. is just basic politeness, and IMHO >> should become a requirement for Nutch. If you are going to hit some >> sites particularly hard, with good reasons, try contacting the website >> administrators and explaining to them why you need such access. We >> both administer, and crawl sites, and as an administrator I am quite >> willing to accept reasonable requests. After all, it is also our goal >> to promote our websites, and already most traffic on the web is >> through search engines. >> >>> 2. The fact that you will be considered as just another Nutch variant >> by >>> web master puts you at a great level of dis-advantage , where you >> could be >>> blocked from accessing the web site at the whims of the web master. >> >> Not sure what you mean by "just another Nutch variant", nor why you >> think that it puts you at a disadvantage. Disadvantage compared to >> whom? Also, "whims of the web master"? Really? After all, it is their >> resources that you are using, and they are perfectly within their >> rights to ban you if they feel, for whatever reason, that you are >> abusing such resources. >> >>> 3. Can anyone share info as to how they overcame this issue when they >>> were starting out , did you establish a relationship with each website >>> owner/master to allows unhindered access ? >>> 4. Any other tips and suggestions would also be greatly appreciated. >> >> Sorry if I am misreading the above, but what you are asking for smells >> like trying to circumvent reasonable requirements. Yes, do try talking >> to website administrators. You might find them to be surprisingly >> accommodating if you are reasonable in return. >> >> Regards, >> Gora >> VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu

