Hi Folks,

I am indexing local file system using file-protocol plugin. I encounter an
issue where the crawler is unable to fetch file name that contains CJK (non
English characters). For my case Korean characters.

I have following file in my target local file system directory.

file1.txt
file2.txt
filewithkorean가맹점정.txt
fileN.txt

When I crawl, the crawler could only fetch file1.txt, file2.txt and
filen.txt. But not the filewithkorean가맹점정.txt.

I tried parser checker command ./bin/nutch
org.apache.nutch.parse.ParserChecker file:///C:/targetdir to check the
outlink extracted the directory. following is the result.

Title: Index of C:\targetdir
Outlinks: 2
  outlink: toUrl: file:/C:/targetdir/file1.txt anchor: file1.txt
  outlink: toUrl: file:/C:/targetdir/file2.txt anchor: file2.txt
  outlink: toUrl: file:/C:/targetdir/filewithkorean??????.txt anchor:
filewithkorean??????.txt
  outlink: toUrl: file:/C:/targetdir/fileN.txt anchor: fileN.txt
Content Metadata: Content-Length=1164 Last-Modified=Wed, 29 Aug 2012
08:47:32 GMT Content-Type=text/html
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8


As above, the korean characters become ????? in the outlink. Thus when the
fetcher runs, it fetches /C:/targetdir/filewithkorean??????.txt instead of
/C:/targetdir/filewithkorean가맹점정.txt and hit 404. 

My initial guess was that CharSet encoding detection in the parser was the
issue. I tried setting different encodings such as, windows-1252, utf-9,
euc-kr and few others. But that does not seem to fix the issue.

Has anyone encountered similar issue and fixed it before? I would appreciate
any suggestion.

Thanks,

Ye

 










--
View this message in context: 
http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to