Hello Folks,

I have  noticed that Nutch resources and mailing lists are mostly geared
towards the usage of Nutch in research oriented projects , I would like to
know from those of you who are using Nutch in production for large scale
crawling (vertical or non-vertical) about what challenges to expect and how
to overcome them.

I will list a few  challenges that  I faced below and would like to hear
from if you faced these challenges you on how you overcame these.


   1. If I were to go for a vertical search engine for websites in a
   particular domain  and follow the crawl-delay directive for politeness in
   the robots.txt , there is a possibility that the web master could still
   block my IP address and I start getting HTTP 403 forbidden/access denied
   messages. How can I  overcome these kind of issues , other than providing
   full contact info in the nutch-site.xml for the web master to get in touch
   with me, before blocking me ?.
   2. The fact that you will be considered as just another Nutch variant by
   web master puts you at a great level of dis-advantage , where you could be
   blocked from accessing the web site at the whims of the web master.
   3. Can anyone share info as to how they overcame this issue when they
   were starting out , did you establish a relationship with each website
   owner/master to allows unhindered access ?
   4. Any other tips and suggestions would also be greatly appreciated.


Thanks.

Reply via email to