On Mon, 4 Apr 2011, Markus Jelsma wrote:
I've got some OCR'd books in plain text format which are incorrectly marked as application/x-elc, probably because of the junk bytes in the head and sometimes tail. I've also seen some being marked as shockwave files. The additional problem is that the file program also marks these files as Lisp data.

One option is that if you trust the filename, you could match on just that. For example, you could decide that http://ia600400.us.archive.org/ can be trusted to get the content type correct on text files, and just use theirs.

Another one that could work in your special case could be to detect on both the start of the file, and say 10% and 20% in. If you get octet stream for the 10% and 20% then you know the first detection is likely to be correct. If they both give text, then there's a fair chance it's actually one of your iffy text files.

Finally, if you know none of your files will be of certain problematic types, you could try just removing them from your mime magic list?

Nick

Reply via email to