RE: What could be blocking me, if not robots.txt?

2011-10-03 Thread Chip Calhoun
I apologize, but I haven't found much Nutch documentation that deals with the user-agent and robots.txt. Why am I being blocked when the user-agent I'm sending matches the user-agent in that robots.txt? Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io]

Re: What could be blocking me, if not robots.txt?

2011-10-03 Thread Markus Jelsma
Oh i misread, your user agent is PHFAWS/Nutch-1.3? Are you sure that that's what is configured as your user agent name? If your name is PHFAWS then the robots.txt must list your name without /Nutch-1.3. Or maybe change the robots.txt to User-agent: PHFAWS/Nutch-1.3 Allow: / On Monday 03

RE: What could be blocking me, if not robots.txt?

2011-10-03 Thread Chip Calhoun
Aha! That's done it. Thanks! Incidentally, I only asked them to add the /Nutch-1.3 because originally I had a user-agent of PHFAWS Spider and had them add PHFAWS Spider to their user-agent, and it didn't work. It seems that at least some sites have trouble with a user-agent that's more than

Re: What could be blocking me, if not robots.txt?

2011-10-03 Thread Julien Nioche
Hi, You can test the robots parsing with : ./nutch plugin lib-http org.apache.nutch.protocol.http.api.RobotRulesParser ~/testRobots.txt ~/testURL PHFAWS/Nutch-1.3 where testRobots.txt contains the robots.txt file that you want to test, testURL has the URLs and finally your user agent. HTH

Nutch not crawling URLs with spanish accented characters (ñ)

2011-10-03 Thread Ramanathapuram, Rajesh
Hi, I am trying to crawl a website which has link(s) with spanish/latin characters in the url filename. I can't get Nutch to crawl the page(s) with spanish accented chars in URL. Link: http://mydomain.com/en Español.aspx http://mydomain.com/en%20Español.aspx or

Re: Nutch not crawling URLs with spanish accented characters ( ñ)

2011-10-03 Thread Markus Jelsma
Looks like you're using protocol-httpclient, try again with the protocol-http plugin instead. We crawler a large part of wikipedia for test purposes and all global modern character sets worked just fine. Can you fetch: http://es.wikipedia.org/wiki/Espa%C3%B1olas with parse or index checker? It

Re: Nutch not crawling URLs with spanish accented characters ( ñ)

2011-10-03 Thread Markus Jelsma
Looks like you're using protocol-httpclient, try again with the protocol-http plugin instead. We crawler a large part of wikipedia for test purposes and all global modern character sets worked just fine. Can you fetch: http://es.wikipedia.org/wiki/Espa%C3%B1olas with parse or index

Re: Nutch not crawling URLs with spanish accented characters ( ñ)

2011-10-03 Thread Ramanathapuram, Rajesh
Thanks Marcus, I 'll try it and let you know in the morning. Rajesh Ramana On Oct 3, 2011, at 5:52 PM, Markus Jelsma markus.jel...@openindex.io wrote: Looks like you're using protocol-httpclient, try again with the protocol-http plugin instead. We crawler a large part of wikipedia for