Hi Nick,

Tika is deployed in my Nutch and it's not a simple task to adjust parse-tika for the first two suggestions. However, removing all types other than plain/text from tika-mimetypes might do the trick. Will Tika then fall back to that type even if it first doesn't mark as such?

Thanks,

On Mon, 4 Apr 2011 17:22:20 +0100 (BST), Nick Burch <[email protected]> wrote:
On Mon, 4 Apr 2011, Markus Jelsma wrote:
I've got some OCR'd books in plain text format which are incorrectly marked as application/x-elc, probably because of the junk bytes in the head and sometimes tail. I've also seen some being marked as shockwave files. The additional problem is that the file program also marks these files as Lisp data.

One option is that if you trust the filename, you could match on just
that. For example, you could decide that
http://ia600400.us.archive.org/ can be trusted to get the content type
correct on text files, and just use theirs.

Another one that could work in your special case could be to detect
on both the start of the file, and say 10% and 20% in. If you get
octet stream for the 10% and 20% then you know the first detection is
likely to be correct. If they both give text, then there's a fair
chance it's actually one of your iffy text files.

Finally, if you know none of your files will be of certain
problematic types, you could try just removing them from your mime
magic list?

Nick

--

Reply via email to