Charset detection

Public Network Services Tue, 24 Jul 2012 16:05:43 -0700

The CHANGES.txt document of Tika 1.2 mentions that

*Tika now returns the detected character encoding as*
*a "charset" parameter of the content type metadata field for text/plain*
*and text/html documents. For example, instead of just "text/plain", the*
*returned content type will be something like "text/plain; charset=UTF-8"*
*for a UTF-8 encoded text document.*



However, when parsing a set of plain text (ASCII) files (some IETF RFCs),
the return type is still just "text/plain", without any charset information.

The code I am using to detect the content of each file is something like:

Tika tika = new Tika();
InputStream is = TikaInputStream.get(new FileInputStream(file));
System.out.println(tika.detect(is));


and the output is still "text/plain", as per previous versions of Tika.

Should that be the case?

Charset detection

Reply via email to