Looks like you're using protocol-httpclient, try again with the protocol-http plugin instead. We crawler a large part of wikipedia for test purposes and all global modern character sets worked just fine.
Can you fetch: http://es.wikipedia.org/wiki/Espa%C3%B1olas with parse or index checker? It works fine here. > > > > I am trying to crawl a website which has link(s) with spanish/latin > characters in the url filename. I can't get Nutch to crawl the page(s) > with spanish accented chars in URL. > > > > Link: http://mydomain.com/en Español.aspx > <http://mydomain.com/en%20Español.aspx> or > http://mydomain.com/en%20Español.aspx > <http://mydomain.com/en%20Español.aspx> > > > > I tried to substitute the URL encode(%F1) for the special character (ñ), > (and %20 is for " "), the whole list here > <http://www.w3schools.com/TAGS/ref_urlencode.asp> . > > > The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the > browser > > > > I tried to use regex URL normalizer to do the substitution in > regex-normalize.xml file as below (%20 is for " ") and (%F1 for the > special character ñ). > > <!-- replaces blank space(" ") in URL with escaped "%20" --> > > <regex> > > <pattern> </pattern> > > <substitution>%20</substitution> > > </regex> > > > > <!-- replaces accented char("ñ") in URL with escaped "%F1" --> > > <regex> > > <pattern>ñ</pattern> > > <substitution>%F1</substitution> > > </regex> > > > > The former(blank space) substitution works fine, but having trouble with > the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ > location in the file) in the command prompt and the below error in my > hadoop log. > > ERROR regex.RegexURLNormalizer - error parsing conf file: > org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 > of 4-byte UTF-8 sequence. > > > > Then I tried changing the character encoding in nutch-site.xml file > > <property> > > <name>parser.character.encoding.default</name> > > <value>ISO-8859-1</value> > > <description>The character encoding to fall back to when no other > information > > is available</description> > > </property> > > And in the regex-normalize.xml file as below > > <regex> > > <pattern>U+00F1</pattern> > > <substitution>%F1</substitution> > > </regex> > > > > Now, I don't have any error in the command prompt and but the below error > in my hadoop log. It looks like the substitution is happening but instead > of the "%F1" it uses "?". > > > > ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri > 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid > > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at > org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:22 > 2) > > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at > org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89) > > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at > org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java > :70) > > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at > org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154) > > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.jav > a:224) > > 2011-10-03 16:44:02,123 ERROR httpclient.Http - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628) > > 2011-10-03 16:44:02,126 INFO fetcher.Fetcher - fetch of > http://mydomain.com/en%20Espa?ol.aspx failed with: > java.lang.IllegalArgumentException: Invalid uri > 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid. > > > > > > Can anyone help me with this issue? Is there any other config changes I > need to do to get this to work? > > > > Thanks in advance, any help in resolving this issue is much appreciated. > > > > thanks & regards, > Rajesh Ramana

