Please have a look at the discussion below

http://www.mail-archive.com/[email protected]/msg04176.html

It should help you out.. or point you in the correct direction at least.

hth

Lewis

On Wed, Aug 29, 2012 at 1:13 PM, ytthet <[email protected]> wrote:
> Hi Folks,
>
> I am indexing local file system using file-protocol plugin. I encounter an
> issue where the crawler is unable to fetch file name that contains CJK (non
> English characters). For my case Korean characters.
>
> I have following file in my target local file system directory.
>
> file1.txt
> file2.txt
> filewithkorean가맹점정.txt
> fileN.txt
>
> When I crawl, the crawler could only fetch file1.txt, file2.txt and
> filen.txt. But not the filewithkorean가맹점정.txt.
>
> I tried parser checker command ./bin/nutch
> org.apache.nutch.parse.ParserChecker file:///C:/targetdir to check the
> outlink extracted the directory. following is the result.
>
> Title: Index of C:\targetdir
> Outlinks: 2
>   outlink: toUrl: file:/C:/targetdir/file1.txt anchor: file1.txt
>   outlink: toUrl: file:/C:/targetdir/file2.txt anchor: file2.txt
>   outlink: toUrl: file:/C:/targetdir/filewithkorean??????.txt anchor:
> filewithkorean??????.txt
>   outlink: toUrl: file:/C:/targetdir/fileN.txt anchor: fileN.txt
> Content Metadata: Content-Length=1164 Last-Modified=Wed, 29 Aug 2012
> 08:47:32 GMT Content-Type=text/html
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>
>
> As above, the korean characters become ????? in the outlink. Thus when the
> fetcher runs, it fetches /C:/targetdir/filewithkorean??????.txt instead of
> /C:/targetdir/filewithkorean가맹점정.txt and hit 404.
>
> My initial guess was that CharSet encoding detection in the parser was the
> issue. I tried setting different encodings such as, windows-1252, utf-9,
> euc-kr and few others. But that does not seem to fix the issue.
>
> Has anyone encountered similar issue and fixed it before? I would appreciate
> any suggestion.
>
> Thanks,
>
> Ye
>
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis

Reply via email to