What is the current state and plan for multibyte
character support by Nutch?

As far as I can tell...

The PDF plugin uses PDFBox (www.pdfbox.org) which does not
work with Japanese and probably other multibyte characters
and code sets.

The Word plugin uses POI (http://jakarta.apache.org/poi/),
which doesn't seem to support Japanese. Some patches to
make it possible to support Japanese (and hopefully other
code sets) have been submitted to the POI project but
they have not been integrated because the project currently
has no committer.

RTF document plugin and PowerPoint plugin use home-grown
parsers.  What is the status of multibyte code set
(and single byte code set other than ISO-8859-1) support by
these plugins?

-Kuro

Reply via email to