Thanks very much Nick. I know about the predefined types in the MediaType class. Perhaps we should get the time to do more of them in a next revision (for instance, it would be nice to have something like MediaType.EXCEL_OLE2 predefined).
Any idea on where the actual parsing of the XML configuration file is done? On Tue, Feb 7, 2012 at 4:04 PM, Nick Burch <[email protected]> wrote: > On Tue, 7 Feb 2012, Public Network Services wrote: > >> Counting tags only, apparently there are 1,304 different variations of >> MIME types there (!), so I would like to map them to, say, a few custom >> top-level categories like "Office", "PDF", "Audio", "Video", or similar. >> >> Assuming this is not done in Tika, what would be the fastest way of >> parsing >> in the 1,304 "registered" MIME types and mapping them to categories? >> > > Audio and Video should be easy, they're already done in the mimetypes > themselves > > Tika mimetypes do have a hierarchy, so you can get some information from > that. For example, all the OLE2 based MS Office formats have a common > parent, as do the OOXML ones, Apple iworks etc > > Nick >
