Gora,

Thanks for sharing your admin perspective , rest assured  I am not trying
to circumvent any politeness requirements in any way , as I mentioned
earlier , I am with in the crawl-delay limits that are being set by the web
masters if any , however , you have confirmed my hunch that I might have to
reach out to individual webmasters to try and convince them to not block my
IP address .

Even if I have as small a number as 100 web sites to crawl , it would be a
huge challenge for us to communicate with each and every webmaster , how
would one go about doing that ? Also is there a standard way the web
masters list their contact info so as to sell them the pitch to or persuade
them to allows us to crawl their websites at a reasonable frequency?

By being at a disadvantage , I meant at a disadvantage compared to major
players like Google, Bing and Yahoo bots , whom the webmasters probably
would not block access, and by Nutch variant , I meant an instance of a
customized crawler based on Nutch.

Thanks.


On Sun, Jun 22, 2014 at 1:33 PM, Gora Mohanty <[email protected]> wrote:

> On 22 June 2014 22:07, Meraj A. Khan <[email protected]> wrote:
> >
> > Hello Folks,
> >
> > I have  noticed that Nutch resources and mailing lists are mostly geared
> > towards the usage of Nutch in research oriented projects , I would like
> to
> > know from those of you who are using Nutch in production for large scale
> > crawling (vertical or non-vertical) about what challenges to expect and
> how
> > to overcome them.
> >
> > I will list a few  challenges that  I faced below and would like to hear
> > from if you faced these challenges you on how you overcame these.
> >
> >
> >    1. If I were to go for a vertical search engine for websites in a
> >    particular domain  and follow the crawl-delay directive for
> politeness in
> >    the robots.txt , there is a possibility that the web master could
> still
> >    block my IP address and I start getting HTTP 403 forbidden/access
> denied
> >    messages. How can I  overcome these kind of issues , other than
> providing
> >    full contact info in the nutch-site.xml for the web master to get in
> touch
> >    with me, before blocking me ?.
>
> Er, providing full access info. is just basic politeness, and IMHO
> should become a requirement for Nutch. If you are going to hit some
> sites particularly hard, with good reasons, try contacting the website
> administrators and explaining to them why you need such access. We
> both administer, and crawl sites, and as an administrator I am quite
> willing to accept reasonable requests. After all, it is also our goal
> to promote our websites, and already most traffic on the web is
> through search engines.
>
> >    2. The fact that you will be considered as just another Nutch variant
> by
> >    web master puts you at a great level of dis-advantage , where you
> could be
> >    blocked from accessing the web site at the whims of the web master.
>
> Not sure what you mean by "just another Nutch variant", nor why you
> think that it puts you at a disadvantage. Disadvantage compared to
> whom? Also, "whims of the web master"? Really? After all, it is their
> resources that you are using, and they are perfectly within their
> rights to ban you if they feel, for whatever reason, that you are
> abusing such resources.
>
> >    3. Can anyone share info as to how they overcame this issue when they
> >    were starting out , did you establish a relationship with each website
> >    owner/master to allows unhindered access ?
> >    4. Any other tips and suggestions would also be greatly appreciated.
>
> Sorry if I am misreading the above, but what you are asking for smells
> like trying to circumvent reasonable requirements. Yes, do try talking
> to website administrators. You might find them to be surprisingly
> accommodating if you are reasonable in return.
>
> Regards,
> Gora
>

Reply via email to