Hi,

On Sun, Sep 2, 2012 at 2:01 PM, Benson Margulies <[email protected]> wrote:
> It has been working fine on many inputs, but I get no text in the
> content handler when I feed it a file in the Shift-JIS encoding.

The text detector in Tika doesn't have a reliable way to detect
Shift-JIS, which is why you're seeing the default
application/octet-stream type. AFAIK there is no good way to reliably
detect Shift-JIS by looking only at the incoming byte stream.

If you already know that you're dealing with text, you can give Tika a
media type hint of "text/plain" or even "text/plain;
charset=Shift--JIS" as input metadata along with the document to be
parsed. That should help Tika determine how to parse the document.

For example, using the Shift-JIS file from
https://issues.alfresco.com/jira/browse/ALF-15233 we get the
following:

$ java -jar tika-app.jar --detect < shiftjs.txt # look only at the byte stream
application/octet-stream

$ java -jar tika-app.jar --detect shiftjs.txt # Give the file name
with .txt ending as a type hint
text/plain

$ java -jar tika-app.jar --text shiftjs.txt # Check that the encoding
is correctly detected
電子商取引(エレクトロニックコマース)、オンライン [...]

Yes!

BR,

Jukka Zitting

Reply via email to