I apologize, but I haven't found much Nutch documentation that deals with the
user-agent and robots.txt. Why am I being blocked when the user-agent I'm
sending matches the user-agent in that robots.txt?
Chip
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Oh i misread, your user agent is PHFAWS/Nutch-1.3? Are you sure that that's
what is configured as your user agent name? If your name is PHFAWS then the
robots.txt must list your name without /Nutch-1.3.
Or maybe change the robots.txt to
User-agent: PHFAWS/Nutch-1.3
Allow: /
On Monday 03
Aha! That's done it. Thanks!
Incidentally, I only asked them to add the /Nutch-1.3 because originally I had
a user-agent of PHFAWS Spider and had them add PHFAWS Spider to their
user-agent, and it didn't work. It seems that at least some sites have trouble
with a user-agent that's more than
Hi,
You can test the robots parsing with : ./nutch plugin lib-http
org.apache.nutch.protocol.http.api.RobotRulesParser ~/testRobots.txt
~/testURL PHFAWS/Nutch-1.3
where testRobots.txt contains the robots.txt file that you want to test,
testURL has the URLs and finally your user agent.
HTH
Hi,
I am trying to crawl a website which has link(s) with spanish/latin characters
in the url filename. I can't get Nutch to crawl the page(s) with spanish
accented chars in URL.
Link: http://mydomain.com/en Español.aspx
http://mydomain.com/en%20Español.aspx or
Looks like you're using protocol-httpclient, try again with the protocol-http
plugin instead. We crawler a large part of wikipedia for test purposes and all
global modern character sets worked just fine.
Can you fetch:
http://es.wikipedia.org/wiki/Espa%C3%B1olas
with parse or index checker? It
Looks like you're using protocol-httpclient, try again with the
protocol-http plugin instead. We crawler a large part of wikipedia for
test purposes and all global modern character sets worked just fine.
Can you fetch:
http://es.wikipedia.org/wiki/Espa%C3%B1olas
with parse or index
Thanks Marcus, I 'll try it and let you know in the morning.
Rajesh Ramana
On Oct 3, 2011, at 5:52 PM, Markus Jelsma markus.jel...@openindex.io wrote:
Looks like you're using protocol-httpclient, try again with the
protocol-http plugin instead. We crawler a large part of wikipedia for
8 matches
Mail list logo