Antoni Mylka wrote:
I was wondering if anyone has any experience with the jchardet library
for charset detection. Does it work? What kinds of documents does it
actually support.
Christiaan has posted an idea to the Aperture tracker how we could use
jchardet to improve the plain text extractor, but it doesn't seem to
work. Or maybe the Tika guys have figured it out already and I can just
use Tika for this? :)
We started using jchardet in conjunction with cpdetector to better
support Chinese, Japanese and Korean documents in our app on all Windows
language variants. Else it would need to fall back to the default
platform encoding or a user setting when a UTF Byte Order Mark was
missing. It seemed to do a pretty good job on the test files that I used
(primarily CJK and English docs). Only recently we found out that
jchardet doesn't detect Cyrillic documents.
It seems that the set of supported charsets in jchardet is a subset of
those supported by Mozilla/Firefox (jcharset is supposed to be a Java
port of the charset detection algorithm in those apps). As additional
charsets are a matter of porting some static data structures encoded in
C or C++ to Java, perhaps it's feasible to do that ourselves? Provided
that the algorithm hasn't changed of course. I did not have any contact
with any of the jchardet developers yet.
When testing the Aperture test docs, only plain-text-utf16le.txt does
not get processed correctly anymore, correct? This is a cpdetector
problem, not a jcharset problem. We already have solid code (IMHO :) )
for BOM detection in our existing PlainTextExtractor, no need to use
cpdetector's ByteOrderMarkDetector.
Regards,
Chris
--