Hello. I think that I had the same problem some weeks ago. Try to resolve it including in *nutch-site.xml* the next properties:
<property> <name>*http.agent.name*</name> <value>*YourCrawlerName*</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. </description> </property> <property> <name>*http.robots.agents*</name> <value>*YourCrawlerName,**</value> <description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* </description> </property> It's mandatory to include your *http.agent.name* into *http.robots.agents*property. Good luck!

