Hi Kh3rad, If a site disallows crawling via robots.txt, then it is CRITICAL that you honor such a directive.
The only time you should ignore robots.txt is if you have explicit permission from the site owner to do so. And even then, it's better if they edit robots.txt to explicitly allow your user agent. Having said that, farsnews.com has no robots.txt file. It appears they are explicitly checking for user agent strings that are not regular web browsers and intentionally causing these requests to time out (which is what you see below). The same thing happens if you try to use curl to access that top page. -- Ken On May 14, 2012, at 2:46am, kh3rad wrote: > Hi, > > I want to crawl a website which denies access to all crawlers. this site is > one of the top site in alexa rank and it is news site. these are my log on > hadoop. i set "Protocol.CHECK_ROBOTS" false in my nutch-site file. > > how can i solve this problem and crawl this site with nutch? > > > 2012-05-14 12:39:51,079 INFO org.apache.nutch.fetcher.Fetcher: fetching > http://farsnews.com/ > 2012-05-14 12:39:56,615 INFO > org.apache.nutch.protocol.http.api.RobotRulesParser: Couldn't get robots.txt > for http://farsnews.com/: java.net.SocketTimeoutException: Read timed out > 2012-05-14 12:40:01,873 INFO org.apache.nutch.fetcher.Fetcher: fetch of > http://farsnews.com/ failed with: java.net.SocketTimeoutException: Read > timed out > > Thanks, > kh3rad > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Couldn-t-get-robots-txt-for-site-tp3983633.html > Sent from the Nutch - User mailing list archive at Nabble.com. -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

