Thanks Lewis,

My guess the issue is either with the encoding in the parser or the file
protocol plugin.

I found this and tried it though. It does not work.
https://issues.apache.org/jira/browse/NUTCH-824

I am still digging around the source code to get it solve.

Regards,

Ye

On Wed, Aug 29, 2012 at 9:12 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Please have a look at the discussion below
>
> http://www.mail-archive.com/[email protected]/msg04176.html
>
> It should help you out.. or point you in the correct direction at least.
>
> hth
>
> Lewis
>
> On Wed, Aug 29, 2012 at 1:13 PM, ytthet <[email protected]> wrote:
> > Hi Folks,
> >
> > I am indexing local file system using file-protocol plugin. I encounter
> an
> > issue where the crawler is unable to fetch file name that contains CJK
> (non
> > English characters). For my case Korean characters.
> >
> > I have following file in my target local file system directory.
> >
> > file1.txt
> > file2.txt
> > filewithkorean가맹점정.txt
> > fileN.txt
> >
> > When I crawl, the crawler could only fetch file1.txt, file2.txt and
> > filen.txt. But not the filewithkorean가맹점정.txt.
> >
> > I tried parser checker command ./bin/nutch
> > org.apache.nutch.parse.ParserChecker file:///C:/targetdir to check the
> > outlink extracted the directory. following is the result.
> >
> > Title: Index of C:\targetdir
> > Outlinks: 2
> >   outlink: toUrl: file:/C:/targetdir/file1.txt anchor: file1.txt
> >   outlink: toUrl: file:/C:/targetdir/file2.txt anchor: file2.txt
> >   outlink: toUrl: file:/C:/targetdir/filewithkorean??????.txt anchor:
> > filewithkorean??????.txt
> >   outlink: toUrl: file:/C:/targetdir/fileN.txt anchor: fileN.txt
> > Content Metadata: Content-Length=1164 Last-Modified=Wed, 29 Aug 2012
> > 08:47:32 GMT Content-Type=text/html
> > Parse Metadata: CharEncodingForConversion=utf-8
> OriginalCharEncoding=utf-8
> >
> >
> > As above, the korean characters become ????? in the outlink. Thus when
> the
> > fetcher runs, it fetches /C:/targetdir/filewithkorean??????.txt instead
> of
> > /C:/targetdir/filewithkorean가맹점정.txt and hit 404.
> >
> > My initial guess was that CharSet encoding detection in the parser was
> the
> > issue. I tried setting different encodings such as, windows-1252, utf-9,
> > euc-kr and few others. But that does not seem to fix the issue.
> >
> > Has anyone encountered similar issue and fixed it before? I would
> appreciate
> > any suggestion.
> >
> > Thanks,
> >
> > Ye
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/local-file-system-crawl-unable-to-fetch-file-name-containing-CJK-letter-tp4003999.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
> --
> Lewis
>

Reply via email to