Are the mime-type patterns ANDed or ORed? If I have a glob and a magic
pattern, does it require both in order to match on the type? Will one or
the other work and which takes precedence? If I list a glob pattern first
and it does not match (i.e. mis-labeled file), will it still check for the
magic?

Based on information at

http://library.gnome.org/admin/system-admin-guide/stable/mimetypes-source-xml.html.en
AND
http://standards.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-0.18.html#id2653327

it looks like the mimetypes.xml file is intended to be used as follows:
glob first. If there are 0 or greater than 1 matches for mime-type, then
try a magic match (if one is provided). If no glob and no magic (or none
provided), default to text/plain or application/octet-stream.

*What about the case where there is a single glob match, but it was
sloppily applied and magic would have correctly typed the file? Can TIKA
save itself from making an error in this case? *I think I would prefer to
see it as magic first, then glob. I've already got the files in memory
anyway so seeking is not a problem....

Any additional insights are welcome.

Thanks

Doug


On Mon, Jun 18, 2012 at 5:11 PM, Nick Burch <[email protected]> wrote:

> On Mon, 18 Jun 2012, Doug wrote:
>
>> I'm planning to use TIKA as part of a process for cataloging data on a
>> share drive. Based on the website and tika-mimetypes.xml, the type
>> detection looks pretty comprehensive. However, while browsing
>> tika-mimetypes.xml, I noticed that about half of the mime-types listed have
>> no associated glob, root-XML, or magic elements. Without this match
>> criteria, can TIKA ever actually detect a file of one of these types?
>>
>
> To be detected, Tika will need something to go on. That could be a glob, a
> XML root element, some magic, or even a combination of all of them.
>
>
>  I browsed the detector source. It looks like it tries to match against
>> magic, then XML, then names/globs/patterns. If a mime-type doesn't have
>> any
>> of these, can TIKA do anything with it? If so, why is it listed in the
>> tike-mimetypes.xml file?
>>
>
> The tike-mimetypes.xml file is used for both detection and information.
> With those entries, we can tell you something about the mimetype, even if
> we can't always detect it
>
> Nick
>

Reply via email to