Thanks Lewis, My guess the issue is either with the encoding in the parser or the file protocol plugin.
I found this and tried it though. It does not work. https://issues.apache.org/jira/browse/NUTCH-824 I am still digging around the source code to get it solve. Regards, Ye On Wed, Aug 29, 2012 at 9:12 PM, Lewis John Mcgibbney < [email protected]> wrote: > Please have a look at the discussion below > > http://www.mail-archive.com/[email protected]/msg04176.html > > It should help you out.. or point you in the correct direction at least. > > hth > > Lewis > > On Wed, Aug 29, 2012 at 1:13 PM, ytthet <[email protected]> wrote: > > Hi Folks, > > > > I am indexing local file system using file-protocol plugin. I encounter > an > > issue where the crawler is unable to fetch file name that contains CJK > (non > > English characters). For my case Korean characters. > > > > I have following file in my target local file system directory. > > > > file1.txt > > file2.txt > > filewithkorean가맹점정.txt > > fileN.txt > > > > When I crawl, the crawler could only fetch file1.txt, file2.txt and > > filen.txt. But not the filewithkorean가맹점정.txt. > > > > I tried parser checker command ./bin/nutch > > org.apache.nutch.parse.ParserChecker file:///C:/targetdir to check the > > outlink extracted the directory. following is the result. > > > > Title: Index of C:\targetdir > > Outlinks: 2 > > outlink: toUrl: file:/C:/targetdir/file1.txt anchor: file1.txt > > outlink: toUrl: file:/C:/targetdir/file2.txt anchor: file2.txt > > outlink: toUrl: file:/C:/targetdir/filewithkorean??????.txt anchor: > > filewithkorean??????.txt > > outlink: toUrl: file:/C:/targetdir/fileN.txt anchor: fileN.txt > > Content Metadata: Content-Length=1164 Last-Modified=Wed, 29 Aug 2012 > > 08:47:32 GMT Content-Type=text/html > > Parse Metadata: CharEncodingForConversion=utf-8 > OriginalCharEncoding=utf-8 > > > > > > As above, the korean characters become ????? in the outlink. Thus when > the > > fetcher runs, it fetches /C:/targetdir/filewithkorean??????.txt instead > of > > /C:/targetdir/filewithkorean가맹점정.txt and hit 404. > > > > My initial guess was that CharSet encoding detection in the parser was > the > > issue. I tried setting different encodings such as, windows-1252, utf-9, > > euc-kr and few others. But that does not seem to fix the issue. > > > > Has anyone encountered similar issue and fixed it before? I would > appreciate > > any suggestion. > > > > Thanks, > > > > Ye > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > View this message in context: > http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > -- > Lewis >

