Hi, On Sun, Sep 2, 2012 at 2:01 PM, Benson Margulies <[email protected]> wrote: > It has been working fine on many inputs, but I get no text in the > content handler when I feed it a file in the Shift-JIS encoding.
The text detector in Tika doesn't have a reliable way to detect Shift-JIS, which is why you're seeing the default application/octet-stream type. AFAIK there is no good way to reliably detect Shift-JIS by looking only at the incoming byte stream. If you already know that you're dealing with text, you can give Tika a media type hint of "text/plain" or even "text/plain; charset=Shift--JIS" as input metadata along with the document to be parsed. That should help Tika determine how to parse the document. For example, using the Shift-JIS file from https://issues.alfresco.com/jira/browse/ALF-15233 we get the following: $ java -jar tika-app.jar --detect < shiftjs.txt # look only at the byte stream application/octet-stream $ java -jar tika-app.jar --detect shiftjs.txt # Give the file name with .txt ending as a type hint text/plain $ java -jar tika-app.jar --text shiftjs.txt # Check that the encoding is correctly detected 電子商取引(エレクトロニックコマース)、オンライン [...] Yes! BR, Jukka Zitting
