Hi, I want to crawl a website which denies access to all crawlers. this site is one of the top site in alexa rank and it is news site. these are my log on hadoop. i set "Protocol.CHECK_ROBOTS" false in my nutch-site file.
how can i solve this problem and crawl this site with nutch? 2012-05-14 12:39:51,079 INFO org.apache.nutch.fetcher.Fetcher: fetching http://farsnews.com/ 2012-05-14 12:39:56,615 INFO org.apache.nutch.protocol.http.api.RobotRulesParser: Couldn't get robots.txt for http://farsnews.com/: java.net.SocketTimeoutException: Read timed out 2012-05-14 12:40:01,873 INFO org.apache.nutch.fetcher.Fetcher: fetch of http://farsnews.com/ failed with: java.net.SocketTimeoutException: Read timed out Thanks, kh3rad -- View this message in context: http://lucene.472066.n3.nabble.com/Couldn-t-get-robots-txt-for-site-tp3983633.html Sent from the Nutch - User mailing list archive at Nabble.com.

