Thanks Marcus, I 'll try it and let you know in the morning.
Rajesh Ramana On Oct 3, 2011, at 5:52 PM, "Markus Jelsma" <[email protected]> wrote: > >> Looks like you're using protocol-httpclient, try again with the >> protocol-http plugin instead. We crawler a large part of wikipedia for >> test purposes and all global modern character sets worked just fine. >> >> Can you fetch: >> http://es.wikipedia.org/wiki/Espa%C3%B1olas >> >> with parse or index checker? It works fine here. > > try bin/nutch org.apache.nutch.parse.ParserChecker <URL> > with both protocol-httpclient and protocol-http. > >> >>> I am trying to crawl a website which has link(s) with spanish/latin >>> characters in the url filename. I can't get Nutch to crawl the page(s) >>> with spanish accented chars in URL. >>> >>> Link: http://mydomain.com/en Español.aspx >>> >>> <http://mydomain.com/en%20Español.aspx> or >>> http://mydomain.com/en%20Español.aspx >>> <http://mydomain.com/en%20Español.aspx> >>> >>> >>> >>> I tried to substitute the URL encode(%F1) for the special character (ñ), >>> (and %20 is for " "), the whole list here >>> <http://www.w3schools.com/TAGS/ref_urlencode.asp> . >>> >>> The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the >>> >>> browser >>> >>> >>> >>> I tried to use regex URL normalizer to do the substitution in >>> regex-normalize.xml file as below (%20 is for " ") and (%F1 for the >>> special character ñ). >>> >>> <!-- replaces blank space(" ") in URL with escaped "%20" --> >>> >>> <regex> >>> >>> <pattern> </pattern> >>> >>> <substitution>%20</substitution> >>> >>> </regex> >>> >>> >>> >>> <!-- replaces accented char("ñ") in URL with escaped "%F1" --> >>> >>> <regex> >>> >>> <pattern>ñ</pattern> >>> >>> <substitution>%F1</substitution> >>> >>> </regex> >>> >>> >>> >>> The former(blank space) substitution works fine, but having trouble with >>> the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ >>> location in the file) in the command prompt and the below error in my >>> hadoop log. >>> >>> ERROR regex.RegexURLNormalizer - error parsing conf file: >>> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 >>> of 4-byte UTF-8 sequence. >>> >>> >>> >>> Then I tried changing the character encoding in nutch-site.xml file >>> >>> <property> >>> >>> <name>parser.character.encoding.default</name> >>> >>> <value>ISO-8859-1</value> >>> >>> <description>The character encoding to fall back to when no other >>> >>> information >>> >>> is available</description> >>> >>> </property> >>> >>> And in the regex-normalize.xml file as below >>> >>> <regex> >>> >>> <pattern>U+00F1</pattern> >>> >>> <substitution>%F1</substitution> >>> >>> </regex> >>> >>> >>> >>> Now, I don't have any error in the command prompt and but the below error >>> in my hadoop log. It looks like the substitution is happening but instead >>> of the "%F1" it uses "?". >>> >>> >>> >>> ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri >>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid >>> >>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>> org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:2 >>> 2 2) >>> >>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>> org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89) >>> >>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.jav >>> a >>> >>> :70) >>> >>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154) >>> >>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.ja >>> v a:224) >>> >>> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at >>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628) >>> >>> 2011-10-03 16:44:02,126 INFO fetcher.Fetcher - fetch of >>> http://mydomain.com/en%20Espa?ol.aspx failed with: >>> java.lang.IllegalArgumentException: Invalid uri >>> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid. >>> >>> >>> >>> >>> >>> Can anyone help me with this issue? Is there any other config changes I >>> need to do to get this to work? >>> >>> >>> >>> Thanks in advance, any help in resolving this issue is much appreciated. >>> >>> >>> >>> thanks & regards, >>> Rajesh Ramana

