Why are you assuming that the web masters are effectively going to block you? 
In my experience this is the least probable escenario.

On Jun 22, 2014, at 4:14 PM, Meraj A. Khan <[email protected]> wrote:

> Gora,
> 
> Thanks for sharing your admin perspective , rest assured  I am not trying
> to circumvent any politeness requirements in any way , as I mentioned
> earlier , I am with in the crawl-delay limits that are being set by the web
> masters if any , however , you have confirmed my hunch that I might have to
> reach out to individual webmasters to try and convince them to not block my
> IP address .
> 
> Even if I have as small a number as 100 web sites to crawl , it would be a
> huge challenge for us to communicate with each and every webmaster , how
> would one go about doing that ? Also is there a standard way the web
> masters list their contact info so as to sell them the pitch to or persuade
> them to allows us to crawl their websites at a reasonable frequency?
> 
> By being at a disadvantage , I meant at a disadvantage compared to major
> players like Google, Bing and Yahoo bots , whom the webmasters probably
> would not block access, and by Nutch variant , I meant an instance of a
> customized crawler based on Nutch.
> 
> Thanks.
> 
> 
> On Sun, Jun 22, 2014 at 1:33 PM, Gora Mohanty <[email protected]> wrote:
> 
>> On 22 June 2014 22:07, Meraj A. Khan <[email protected]> wrote:
>>> 
>>> Hello Folks,
>>> 
>>> I have  noticed that Nutch resources and mailing lists are mostly geared
>>> towards the usage of Nutch in research oriented projects , I would like
>> to
>>> know from those of you who are using Nutch in production for large scale
>>> crawling (vertical or non-vertical) about what challenges to expect and
>> how
>>> to overcome them.
>>> 
>>> I will list a few  challenges that  I faced below and would like to hear
>>> from if you faced these challenges you on how you overcame these.
>>> 
>>> 
>>>  1. If I were to go for a vertical search engine for websites in a
>>>  particular domain  and follow the crawl-delay directive for
>> politeness in
>>>  the robots.txt , there is a possibility that the web master could
>> still
>>>  block my IP address and I start getting HTTP 403 forbidden/access
>> denied
>>>  messages. How can I  overcome these kind of issues , other than
>> providing
>>>  full contact info in the nutch-site.xml for the web master to get in
>> touch
>>>  with me, before blocking me ?.
>> 
>> Er, providing full access info. is just basic politeness, and IMHO
>> should become a requirement for Nutch. If you are going to hit some
>> sites particularly hard, with good reasons, try contacting the website
>> administrators and explaining to them why you need such access. We
>> both administer, and crawl sites, and as an administrator I am quite
>> willing to accept reasonable requests. After all, it is also our goal
>> to promote our websites, and already most traffic on the web is
>> through search engines.
>> 
>>>  2. The fact that you will be considered as just another Nutch variant
>> by
>>>  web master puts you at a great level of dis-advantage , where you
>> could be
>>>  blocked from accessing the web site at the whims of the web master.
>> 
>> Not sure what you mean by "just another Nutch variant", nor why you
>> think that it puts you at a disadvantage. Disadvantage compared to
>> whom? Also, "whims of the web master"? Really? After all, it is their
>> resources that you are using, and they are perfectly within their
>> rights to ban you if they feel, for whatever reason, that you are
>> abusing such resources.
>> 
>>>  3. Can anyone share info as to how they overcame this issue when they
>>>  were starting out , did you establish a relationship with each website
>>>  owner/master to allows unhindered access ?
>>>  4. Any other tips and suggestions would also be greatly appreciated.
>> 
>> Sorry if I am misreading the above, but what you are asking for smells
>> like trying to circumvent reasonable requirements. Yes, do try talking
>> to website administrators. You might find them to be surprisingly
>> accommodating if you are reasonable in return.
>> 
>> Regards,
>> Gora
>> 

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu

Reply via email to