Re: Wrong Mime-type detection

Nick Burch Mon, 04 Apr 2011 09:22:54 -0700

On Mon, 4 Apr 2011, Markus Jelsma wrote:

I've got some OCR'd books in plain text format which are incorrectlymarked as application/x-elc, probably because of the junk bytes in thehead and sometimes tail. I've also seen some being marked as shockwavefiles. The additional problem is that the file program also marks thesefiles as Lisp data.

One option is that if you trust the filename, you could match on justthat. For example, you could decide that http://ia600400.us.archive.org/can be trusted to get the content type correct on text files, and just usetheirs.

Another one that could work in your special case could be to detect onboth the start of the file, and say 10% and 20% in. If you get octetstream for the 10% and 20% then you know the first detection is likely tobe correct. If they both give text, then there's a fair chance it'sactually one of your iffy text files.

Finally, if you know none of your files will be of certain problematictypes, you could try just removing them from your mime magic list?


Nick

Re: Wrong Mime-type detection

Reply via email to