Hi Kh3rad,

If a site disallows crawling via robots.txt, then it is CRITICAL that you honor 
such a directive.

The only time you should ignore robots.txt is if you have explicit permission 
from the site owner to do so.

And even then, it's better if they edit robots.txt to explicitly allow your 
user agent.

Having said that, farsnews.com has no robots.txt file. It appears they are 
explicitly checking for user agent strings that are not regular web browsers 
and intentionally causing these requests to time out (which is what you see 
below). The same thing happens if you try to use curl to access that top page.

-- Ken

On May 14, 2012, at 2:46am, kh3rad wrote:

> Hi,
> 
> I want to crawl a website which denies access to all crawlers. this site is
> one of the top site in alexa rank and it is news site. these are my log on
> hadoop. i set  "Protocol.CHECK_ROBOTS" false in my nutch-site file.
> 
> how can i solve this problem and crawl this site with nutch?
> 
> 
> 2012-05-14 12:39:51,079 INFO org.apache.nutch.fetcher.Fetcher: fetching
> http://farsnews.com/
> 2012-05-14 12:39:56,615 INFO
> org.apache.nutch.protocol.http.api.RobotRulesParser: Couldn't get robots.txt
> for http://farsnews.com/: java.net.SocketTimeoutException: Read timed out
> 2012-05-14 12:40:01,873 INFO org.apache.nutch.fetcher.Fetcher: fetch of
> http://farsnews.com/ failed with: java.net.SocketTimeoutException: Read
> timed out
> 
> Thanks, 
> kh3rad
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Couldn-t-get-robots-txt-for-site-tp3983633.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply via email to