[ https://issues.apache.org/jira/browse/TIKA-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661997#action_12661997 ]
Andrzej Rusin commented on TIKA-154: ------------------------------------ I implemented a simple, maybe a bit naive, but working checking mechanism for text files: public boolean isProperFile(File file, String mimeTypeName) throws IOException { //we check only text types here if (!mimeTypeName.startsWith("text")) return true; Perl5Util util = new Perl5Util(); byte[] data = getFileSample(file); if (data == null) //empty file, can assume as text return true; String s = new String(data, "UTF-8"); if (!util.match("/[^[:ascii:][:space:]]/", s)) { return true; } return false; } protected byte[] getFileSample(File file) throws IOException, IOException { byte[] data = new byte[SAMPLE_SIZE]; FileInputStream fs = null; try { fs = new FileInputStream(file); int read = fs.read(data); if (read < 0) return null; data = Arrays.copyOfRange(data, 0, read); } finally { if (fs != null) fs.close(); } return data; } > Better detection of plain text versus binary formats with a text header > ----------------------------------------------------------------------- > > Key: TIKA-154 > URL: https://issues.apache.org/jira/browse/TIKA-154 > Project: Tika > Issue Type: Improvement > Components: mime > Reporter: Jukka Zitting > Priority: Minor > > Antoni Mylka noted on the mailing list: > Many binary formats begin with magic byte sequences composed of ASCII > characters, e.g. > zipfiles begin with PK > pdfs begin with %PDF- > chms help files begin with ITSF > etc. > Tika should do a better job of detecting such cases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.