Nutch not crawling URLs with spanish accented characters (ñ)

Ramanathapuram, Rajesh Mon, 03 Oct 2011 14:34:25 -0700

Hi,


I am trying to crawl a website which has link(s) with spanish/latin characters 
in the url filename. I can't get Nutch to crawl the page(s) with spanish 
accented chars in URL. 

 

  Link: http://mydomain.com/en Español.aspx 
<http://mydomain.com/en%20Español.aspx>   or 
http://mydomain.com/en%20Español.aspx <http://mydomain.com/en%20Español.aspx>   

 

I tried to substitute the URL encode(%F1) for the special character (ñ), (and 
%20 is for " "), the whole list here 
<http://www.w3schools.com/TAGS/ref_urlencode.asp> .


  The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the browser

 

I tried to use regex URL normalizer to do the substitution in 
regex-normalize.xml file as below  (%20 is for " ") and (%F1 for the special 
character ñ).

<!-- replaces blank space(" ") in URL with escaped "%20"  -->

<regex>

  <pattern> </pattern>

  <substitution>%20</substitution>

</regex>

 

<!-- replaces accented char("ñ") in URL with escaped "%F1"  -->

<regex>

  <pattern>ñ</pattern>

  <substitution>%F1</substitution>

</regex>

 

The former(blank space) substitution works fine, but having trouble with the 
latter (ñ) substitution, I am getting a FATAL error(pointing to ñ location in 
the file) in the command prompt and the below error in my hadoop log.

     ERROR regex.RegexURLNormalizer - error parsing conf file: 
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 
4-byte UTF-8 sequence.

  

Then I tried changing the character encoding in nutch-site.xml file

<property>

  <name>parser.character.encoding.default</name>

  <value>ISO-8859-1</value>

  <description>The character encoding to fall back to when no other information

  is available</description>

</property>

  And in the regex-normalize.xml file as below 

<regex>

  <pattern>U+00F1</pattern>

  <substitution>%F1</substitution>

</regex>

 

Now, I don't have any error in the command prompt and but the below error in my 
hadoop log. It looks like the substitution is happening but instead of the 
"%F1" it uses "?".

 

ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri 
'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid

2011-10-03 16:44:02,123 ERROR httpclient.Http - at 
org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:222)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at 
org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at 
org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:70)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at 
org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:224)

2011-10-03 16:44:02,123 ERROR httpclient.Http - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)

2011-10-03 16:44:02,126 INFO  fetcher.Fetcher - fetch of 
http://mydomain.com/en%20Espa?ol.aspx failed with: 
java.lang.IllegalArgumentException: Invalid uri 
'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.

 

 

Can anyone help me with this issue? Is there any other config changes I need to 
do to get this to work?

 

Thanks in advance, any help in resolving this issue is much appreciated. 

 

thanks & regards,
Rajesh Ramana

Nutch not crawling URLs with spanish accented characters (ñ)

Reply via email to