Hi Folks,

I solved the issue. I am sharing it here in case if others have similar
unsolved issue.

It is due to the bug in the protocol-file plugin. FileResponse.java. File
name is not properly encoded for UTF 8 file name. I changed some code in
the constructor and one private method called list2html. The change is the
combination of the discussion on following tow JIRAs.

https://issues.apache.org/jira/browse/NUTCH-824
https://issues.apache.org/jira/browse/NUTCH-968

It is important to change the code both in constructor and the private
method.

Cheers,

Ye


On Wed, Aug 29, 2012 at 10:52 PM, hugo.ma <[email protected]> wrote:

> I had a similar problem. My solution was to modify the HTTPREsponse class
> in
> org.apache.nutch.protocol.httpclient.
>
> In Constructor i changed the first lines like this:
>
>  // Prepare GET method for HTTP request
>    this.url = url;
>    URI uri =null;
>      //MODIFIED
>
>    try {
>      uri = new URI(url.getProtocol(), url.getHost(), url.getPath(),
> url.getQuery(), null);
>    } catch (Exception e) {
>    // do whatever you want
>   }
>
>  GetMethod get = new GetMethod(uri.toASCIIString());
>
> //Continue with the original code
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999p4004059.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to