Re: Wrong Mime-type detection

Markus Jelsma Wed, 06 Apr 2011 06:53:47 -0700

Hi Nick,

Tika is deployed in my Nutch and it's not a simple task to adjustparse-tika for the first two suggestions. However, removing all typesother than plain/text from tika-mimetypes might do the trick. Will Tikathen fall back to that type even if it first doesn't mark as such?


Thanks,

On Mon, 4 Apr 2011 17:22:20 +0100 (BST), Nick Burch<[email protected]> wrote:

On Mon, 4 Apr 2011, Markus Jelsma wrote:
I've got some OCR'd books in plain text format which are incorrectlymarked as application/x-elc, probably because of the junk bytes in thehead and sometimes tail. I've also seen some being marked as shockwavefiles. The additional problem is that the file program also marksthese files as Lisp data.
One option is that if you trust the filename, you could match on just
that. For example, you could decide that
http://ia600400.us.archive.org/ can be trusted to get the contenttype
correct on text files, and just use theirs.

Another one that could work in your special case could be to detect
on both the start of the file, and say 10% and 20% in. If you get
octet stream for the 10% and 20% then you know the first detection is
likely to be correct. If they both give text, then there's a fair
chance it's actually one of your iffy text files.

Finally, if you know none of your files will be of certain
problematic types, you could try just removing them from your mime
magic list?

Nick

--

Re: Wrong Mime-type detection

Reply via email to