Thanks very much Nick.

I know about the predefined types in the MediaType class. Perhaps we should
get the time to do more of them in a next revision (for instance, it would
be nice to have something like MediaType.EXCEL_OLE2 predefined).

Any idea on where the actual parsing of the XML configuration file is done?


On Tue, Feb 7, 2012 at 4:04 PM, Nick Burch <[email protected]> wrote:

> On Tue, 7 Feb 2012, Public Network Services wrote:
>
>> Counting tags only, apparently there are 1,304 different variations of
>> MIME types there (!), so I would like to map them to, say, a few custom
>> top-level categories like "Office", "PDF", "Audio", "Video", or similar.
>>
>> Assuming this is not done in Tika, what would be the fastest way of
>> parsing
>> in the 1,304 "registered" MIME types and mapping them to categories?
>>
>
> Audio and Video should be easy, they're already done in the mimetypes
> themselves
>
> Tika mimetypes do have a hierarchy, so you can get some information from
> that. For example, all the OLE2 based MS Office formats have a common
> parent, as do the OOXML ones, Apple iworks etc
>
> Nick
>

Reply via email to