What is the current state and plan for multibyte character support by Nutch?
As far as I can tell... The PDF plugin uses PDFBox (www.pdfbox.org) which does not work with Japanese and probably other multibyte characters and code sets. The Word plugin uses POI (http://jakarta.apache.org/poi/), which doesn't seem to support Japanese. Some patches to make it possible to support Japanese (and hopefully other code sets) have been submitted to the POI project but they have not been integrated because the project currently has no committer. RTF document plugin and PowerPoint plugin use home-grown parsers. What is the status of multibyte code set (and single byte code set other than ISO-8859-1) support by these plugins? -Kuro