Are the mime-type patterns ANDed or ORed? If I have a glob and a magic pattern, does it require both in order to match on the type? Will one or the other work and which takes precedence? If I list a glob pattern first and it does not match (i.e. mis-labeled file), will it still check for the magic?
Based on information at http://library.gnome.org/admin/system-admin-guide/stable/mimetypes-source-xml.html.en AND http://standards.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-0.18.html#id2653327 it looks like the mimetypes.xml file is intended to be used as follows: glob first. If there are 0 or greater than 1 matches for mime-type, then try a magic match (if one is provided). If no glob and no magic (or none provided), default to text/plain or application/octet-stream. *What about the case where there is a single glob match, but it was sloppily applied and magic would have correctly typed the file? Can TIKA save itself from making an error in this case? *I think I would prefer to see it as magic first, then glob. I've already got the files in memory anyway so seeking is not a problem.... Any additional insights are welcome. Thanks Doug On Mon, Jun 18, 2012 at 5:11 PM, Nick Burch <[email protected]> wrote: > On Mon, 18 Jun 2012, Doug wrote: > >> I'm planning to use TIKA as part of a process for cataloging data on a >> share drive. Based on the website and tika-mimetypes.xml, the type >> detection looks pretty comprehensive. However, while browsing >> tika-mimetypes.xml, I noticed that about half of the mime-types listed have >> no associated glob, root-XML, or magic elements. Without this match >> criteria, can TIKA ever actually detect a file of one of these types? >> > > To be detected, Tika will need something to go on. That could be a glob, a > XML root element, some magic, or even a combination of all of them. > > > I browsed the detector source. It looks like it tries to match against >> magic, then XML, then names/globs/patterns. If a mime-type doesn't have >> any >> of these, can TIKA do anything with it? If so, why is it listed in the >> tike-mimetypes.xml file? >> > > The tike-mimetypes.xml file is used for both detection and information. > With those entries, we can tell you something about the mimetype, even if > we can't always detect it > > Nick >
