Hi,

I want to crawl a website which denies access to all crawlers. this site is
one of the top site in alexa rank and it is news site. these are my log on
hadoop. i set  "Protocol.CHECK_ROBOTS" false in my nutch-site file.

how can i solve this problem and crawl this site with nutch?


2012-05-14 12:39:51,079 INFO org.apache.nutch.fetcher.Fetcher: fetching
http://farsnews.com/
2012-05-14 12:39:56,615 INFO
org.apache.nutch.protocol.http.api.RobotRulesParser: Couldn't get robots.txt
for http://farsnews.com/: java.net.SocketTimeoutException: Read timed out
2012-05-14 12:40:01,873 INFO org.apache.nutch.fetcher.Fetcher: fetch of
http://farsnews.com/ failed with: java.net.SocketTimeoutException: Read
timed out

Thanks, 
kh3rad

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Couldn-t-get-robots-txt-for-site-tp3983633.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to