Antoni Mylka wrote:
I was wondering if anyone has any experience with the jchardet library
for charset detection. Does it work? What kinds of documents does it
actually support.

Christiaan has posted an idea to the Aperture tracker how we could use
jchardet to improve the plain text extractor, but it doesn't seem to
work.  Or maybe the Tika guys have figured it out already and I can just
use Tika for this? :)

We started using jchardet in conjunction with cpdetector to better support Chinese, Japanese and Korean documents in our app on all Windows language variants. Else it would need to fall back to the default platform encoding or a user setting when a UTF Byte Order Mark was missing. It seemed to do a pretty good job on the test files that I used (primarily CJK and English docs). Only recently we found out that jchardet doesn't detect Cyrillic documents.

It seems that the set of supported charsets in jchardet is a subset of those supported by Mozilla/Firefox (jcharset is supposed to be a Java port of the charset detection algorithm in those apps). As additional charsets are a matter of porting some static data structures encoded in C or C++ to Java, perhaps it's feasible to do that ourselves? Provided that the algorithm hasn't changed of course. I did not have any contact with any of the jchardet developers yet.

When testing the Aperture test docs, only plain-text-utf16le.txt does not get processed correctly anymore, correct? This is a cpdetector problem, not a jcharset problem. We already have solid code (IMHO :) ) for BOM detection in our existing PlainTextExtractor, no need to use cpdetector's ByteOrderMarkDetector.


Regards,

Chris
--

Reply via email to