Re: Nutch not crawling URLs with spanish accented characters ( ñ)

Markus Jelsma Mon, 03 Oct 2011 14:47:05 -0700

Looks like you're using protocol-httpclient, try again with the protocol-http 
plugin instead. We crawler a large part of wikipedia for test purposes and all 
global modern character sets worked just fine.


Can you fetch:
http://es.wikipedia.org/wiki/Espa%C3%B1olas

with parse or index checker? It works fine here.


> 
> 
> 
> I am trying to crawl a website which has link(s) with spanish/latin
> characters in the url filename. I can't get Nutch to crawl the page(s)
> with spanish accented chars in URL.
> 
> 
> 
>   Link: http://mydomain.com/en Español.aspx
> <http://mydomain.com/en%20Español.aspx>   or
> http://mydomain.com/en%20Español.aspx
> <http://mydomain.com/en%20Español.aspx>
> 
> 
> 
> I tried to substitute the URL encode(%F1) for the special character (ñ),
> (and %20 is for " "), the whole list here
> <http://www.w3schools.com/TAGS/ref_urlencode.asp> .
> 
> 
>   The link http://mydomain.com/en%20Espa%F1ol.aspx works fine in the
> browser
> 
> 
> 
> I tried to use regex URL normalizer to do the substitution in
> regex-normalize.xml file as below  (%20 is for " ") and (%F1 for the
> special character ñ).
> 
> <!-- replaces blank space(" ") in URL with escaped "%20"  -->
> 
> <regex>
> 
>   <pattern> </pattern>
> 
>   <substitution>%20</substitution>
> 
> </regex>
> 
> 
> 
> <!-- replaces accented char("ñ") in URL with escaped "%F1"  -->
> 
> <regex>
> 
>   <pattern>ñ</pattern>
> 
>   <substitution>%F1</substitution>
> 
> </regex>
> 
> 
> 
> The former(blank space) substitution works fine, but having trouble with
> the latter (ñ) substitution, I am getting a FATAL error(pointing to ñ
> location in the file) in the command prompt and the below error in my
> hadoop log.
> 
>      ERROR regex.RegexURLNormalizer - error parsing conf file:
> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2
> of 4-byte UTF-8 sequence.
> 
> 
> 
> Then I tried changing the character encoding in nutch-site.xml file
> 
> <property>
> 
>   <name>parser.character.encoding.default</name>
> 
>   <value>ISO-8859-1</value>
> 
>   <description>The character encoding to fall back to when no other
> information
> 
>   is available</description>
> 
> </property>
> 
>   And in the regex-normalize.xml file as below
> 
> <regex>
> 
>   <pattern>U+00F1</pattern>
> 
>   <substitution>%F1</substitution>
> 
> </regex>
> 
> 
> 
> Now, I don't have any error in the command prompt and but the below error
> in my hadoop log. It looks like the substitution is happening but instead
> of the "%F1" it uses "?".
> 
> 
> 
> ERROR httpclient.Http - java.lang.IllegalArgumentException: Invalid uri
> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid
> 
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:22
> 2)
> 
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89)
> 
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java
> :70)
> 
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
> 
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.jav
> a:224)
> 
> 2011-10-03 16:44:02,123 ERROR httpclient.Http - at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:628)
> 
> 2011-10-03 16:44:02,126 INFO  fetcher.Fetcher - fetch of
> http://mydomain.com/en%20Espa?ol.aspx failed with:
> java.lang.IllegalArgumentException: Invalid uri
> 'http://mydomain.com/en%20Espa?ol.aspx': escaped absolute path not valid.
> 
> 
> 
> 
> 
> Can anyone help me with this issue? Is there any other config changes I
> need to do to get this to work?
> 
> 
> 
> Thanks in advance, any help in resolving this issue is much appreciated.
> 
> 
> 
> thanks & regards,
> Rajesh Ramana

Re: Nutch not crawling URLs with spanish accented characters ( ñ)

Reply via email to