On Tue, 7 Feb 2012, Public Network Services wrote:
Counting tags only, apparently there are 1,304 different variations of
MIME types there (!), so I would like to map them to, say, a few custom
top-level categories like "Office", "PDF", "Audio", "Video", or similar.
Assuming this is not done in Tika, what would be the fastest way of parsing
in the 1,304 "registered" MIME types and mapping them to categories?
Audio and Video should be easy, they're already done in the mimetypes
themselves
Tika mimetypes do have a hierarchy, so you can get some information from
that. For example, all the OLE2 based MS Office formats have a common
parent, as do the OOXML ones, Apple iworks etc
Nick