Hi Folks, I am indexing local file system using file-protocol plugin. I encounter an issue where the crawler is unable to fetch file name that contains CJK (non English characters). For my case Korean characters.
I have following file in my target local file system directory. file1.txt file2.txt filewithkorean가맹점정.txt fileN.txt When I crawl, the crawler could only fetch file1.txt, file2.txt and filen.txt. But not the filewithkorean가맹점정.txt. I tried parser checker command ./bin/nutch org.apache.nutch.parse.ParserChecker file:///C:/targetdir to check the outlink extracted the directory. following is the result. Title: Index of C:\targetdir Outlinks: 2 outlink: toUrl: file:/C:/targetdir/file1.txt anchor: file1.txt outlink: toUrl: file:/C:/targetdir/file2.txt anchor: file2.txt outlink: toUrl: file:/C:/targetdir/filewithkorean??????.txt anchor: filewithkorean??????.txt outlink: toUrl: file:/C:/targetdir/fileN.txt anchor: fileN.txt Content Metadata: Content-Length=1164 Last-Modified=Wed, 29 Aug 2012 08:47:32 GMT Content-Type=text/html Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 As above, the korean characters become ????? in the outlink. Thus when the fetcher runs, it fetches /C:/targetdir/filewithkorean??????.txt instead of /C:/targetdir/filewithkorean가맹점정.txt and hit 404. My initial guess was that CharSet encoding detection in the parser was the issue. I tried setting different encodings such as, windows-1252, utf-9, euc-kr and few others. But that does not seem to fix the issue. Has anyone encountered similar issue and fixed it before? I would appreciate any suggestion. Thanks, Ye -- View this message in context: http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999.html Sent from the Nutch - User mailing list archive at Nabble.com.

