Re: [Aperture-devel] Charset detection

Christiaan Fluit Wed, 09 Dec 2009 12:34:00 -0800

Antoni Mylka wrote:

I was wondering if anyone has any experience with the jchardet library
for charset detection. Does it work? What kinds of documents does it
actually support.


Christiaan has posted an idea to the Aperture tracker how we could use
jchardet to improve the plain text extractor, but it doesn't seem to
work.  Or maybe the Tika guys have figured it out already and I can just
use Tika for this? :)

We started using jchardet in conjunction with cpdetector to bettersupport Chinese, Japanese and Korean documents in our app on all Windowslanguage variants. Else it would need to fall back to the defaultplatform encoding or a user setting when a UTF Byte Order Mark wasmissing. It seemed to do a pretty good job on the test files that I used(primarily CJK and English docs). Only recently we found out thatjchardet doesn't detect Cyrillic documents.

It seems that the set of supported charsets in jchardet is a subset ofthose supported by Mozilla/Firefox (jcharset is supposed to be a Javaport of the charset detection algorithm in those apps). As additionalcharsets are a matter of porting some static data structures encoded inC or C++ to Java, perhaps it's feasible to do that ourselves? Providedthat the algorithm hasn't changed of course. I did not have any contactwith any of the jchardet developers yet.

When testing the Aperture test docs, only plain-text-utf16le.txt doesnot get processed correctly anymore, correct? This is a cpdetectorproblem, not a jcharset problem. We already have solid code (IMHO :) )for BOM detection in our existing PlainTextExtractor, no need to usecpdetector's ByteOrderMarkDetector.



Regards,

Chris
--

Re: [Aperture-devel] Charset detection

Reply via email to