Hi Folks, I solved the issue. I am sharing it here in case if others have similar unsolved issue.
It is due to the bug in the protocol-file plugin. FileResponse.java. File name is not properly encoded for UTF 8 file name. I changed some code in the constructor and one private method called list2html. The change is the combination of the discussion on following tow JIRAs. https://issues.apache.org/jira/browse/NUTCH-824 https://issues.apache.org/jira/browse/NUTCH-968 It is important to change the code both in constructor and the private method. Cheers, Ye On Wed, Aug 29, 2012 at 10:52 PM, hugo.ma <[email protected]> wrote: > I had a similar problem. My solution was to modify the HTTPREsponse class > in > org.apache.nutch.protocol.httpclient. > > In Constructor i changed the first lines like this: > > // Prepare GET method for HTTP request > this.url = url; > URI uri =null; > //MODIFIED > > try { > uri = new URI(url.getProtocol(), url.getHost(), url.getPath(), > url.getQuery(), null); > } catch (Exception e) { > // do whatever you want > } > > GetMethod get = new GetMethod(uri.toASCIIString()); > > //Continue with the original code > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999p4004059.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

