Please have a look at the discussion below http://www.mail-archive.com/[email protected]/msg04176.html
It should help you out.. or point you in the correct direction at least. hth Lewis On Wed, Aug 29, 2012 at 1:13 PM, ytthet <[email protected]> wrote: > Hi Folks, > > I am indexing local file system using file-protocol plugin. I encounter an > issue where the crawler is unable to fetch file name that contains CJK (non > English characters). For my case Korean characters. > > I have following file in my target local file system directory. > > file1.txt > file2.txt > filewithkorean가맹점정.txt > fileN.txt > > When I crawl, the crawler could only fetch file1.txt, file2.txt and > filen.txt. But not the filewithkorean가맹점정.txt. > > I tried parser checker command ./bin/nutch > org.apache.nutch.parse.ParserChecker file:///C:/targetdir to check the > outlink extracted the directory. following is the result. > > Title: Index of C:\targetdir > Outlinks: 2 > outlink: toUrl: file:/C:/targetdir/file1.txt anchor: file1.txt > outlink: toUrl: file:/C:/targetdir/file2.txt anchor: file2.txt > outlink: toUrl: file:/C:/targetdir/filewithkorean??????.txt anchor: > filewithkorean??????.txt > outlink: toUrl: file:/C:/targetdir/fileN.txt anchor: fileN.txt > Content Metadata: Content-Length=1164 Last-Modified=Wed, 29 Aug 2012 > 08:47:32 GMT Content-Type=text/html > Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 > > > As above, the korean characters become ????? in the outlink. Thus when the > fetcher runs, it fetches /C:/targetdir/filewithkorean??????.txt instead of > /C:/targetdir/filewithkorean가맹점정.txt and hit 404. > > My initial guess was that CharSet encoding detection in the parser was the > issue. I tried setting different encodings such as, windows-1252, utf-9, > euc-kr and few others. But that does not seem to fix the issue. > > Has anyone encountered similar issue and fixed it before? I would appreciate > any suggestion. > > Thanks, > > Ye > > > > > > > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999.html > Sent from the Nutch - User mailing list archive at Nabble.com. -- Lewis

