[ 
https://issues.apache.org/jira/browse/TIKA-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661997#action_12661997
 ] 

Andrzej Rusin commented on TIKA-154:
------------------------------------

I implemented a simple, maybe a bit naive, but working checking mechanism for 
text files:

        public boolean isProperFile(File file, String mimeTypeName) throws 
IOException {
                
                //we check only text types here
                if (!mimeTypeName.startsWith("text"))
                        return true;
                
                Perl5Util util = new Perl5Util();
                byte[] data = getFileSample(file);
                
                if (data == null)
                        //empty file, can assume as text
                        return true;
                
                String s = new String(data, "UTF-8");
                if (!util.match("/[^[:ascii:][:space:]]/", s)) {
                        return true;
                }

                return false;
        }

        protected byte[] getFileSample(File file) throws IOException,
                        IOException {
                byte[] data = new byte[SAMPLE_SIZE];
                FileInputStream fs = null;
                try {
                        fs = new FileInputStream(file);
                        int read = fs.read(data);
                                                
                        if (read < 0)
                                return null;
                        
                        data = Arrays.copyOfRange(data, 0, read);
                } finally {
                        if (fs != null)
                                fs.close();     
                }
                return data;
        }

> Better detection of plain text versus binary formats with a text header
> -----------------------------------------------------------------------
>
>                 Key: TIKA-154
>                 URL: https://issues.apache.org/jira/browse/TIKA-154
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> Antoni Mylka noted on the mailing list:
>     Many binary formats begin with magic byte sequences composed of ASCII 
> characters, e.g.
>     zipfiles begin with PK
>     pdfs begin with %PDF-
>     chms help files begin with ITSF
>     etc.
> Tika should do a better job of detecting such cases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to