Oops! Forgot to mention, I am using Nutch 1.2. thanks & regards, Rajesh Ramana
-----Original Message----- Sent: Monday, October 03, 2011 5:27 PM To: [email protected] Subject: Nutch not crawling URLs with spanish accented characters (ñ) Hi, I am trying to crawl a website which has link(s) with spanish/latin characters in the url filename. I can't get Nutch to crawl the page(s) with spanish accented chars in URL. Link: http://mydomain.com/en Español.aspx <http://mydomain.com/en%20Español.aspx> or http://mydomain.com/en%20Español.aspx <http://mydomain.com/en%20Español.aspx> I tried to substitute the URL encode(%F1) for the special character (ñ), (and %20 is for " "), the whole list here <http://www.w3schools.com/TAGS/ref_urlencode.asp> . The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the browser I tried to use regex URL normalizer to do the substitution in regex-normalize.xml file as below (%20 is for " ") and (%F1 for the special character ñ). <!-- replaces blank space(" ") in URL with escaped "%20" --> <regex> <pattern> </pattern> <substitution>%20</substitution> </regex> <!-- replaces accented char("ñ") in URL with escaped "%F1" --> <regex> <pattern>ñ</pattern> <substitution>%F1</substitution> </regex> The former(blank space) substitution works fine, but having trouble with the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ location in the file) in the command prompt and the below error in my hadoop log. ERROR regex.RegexURLNormalizer - error parsing conf file: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. Then I tried changing the character encoding in nutch-site.xml file <property> <name>parser.character.encoding.default</name> <value>ISO-8859-1</value> <description>The character encoding to fall back to when no other information is available</description> </property> And in the regex-normalize.xml file as below <regex> <pattern>U+00F1</pattern> <substitution>%F1</substitution> </regex> Now, I don't have any error in the command prompt and but the below error in my hadoop log. It looks like the substitution is happening but instead of the "%F1" it uses "?". ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid 2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:222) 2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89) 2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:70) 2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154) 2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:224) 2011-10-03 16:44:02,123 ERROR httpclient.Http - at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628) 2011-10-03 16:44:02,126 INFO fetcher.Fetcher - fetch of http://mydomain.com/en%20Espa?ol.aspx failed with: java.lang.IllegalArgumentException: Invalid uri 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid. Can anyone help me with this issue? Is there any other config changes I need to do to get this to work? Thanks in advance, any help in resolving this issue is much appreciated. thanks & regards, Rajesh Ramana

