Nick, thank you for everything.
I humbly accept all of your comments and will check the mediatype then recurse through supertypes. I will also check aliases of my expected Media-types to enhance the media-type recognition. I pitched the team to drop support for unrecognized (by Tika) media-types, and if Tika decides to insert them into it's registry then we will support them automatically. I have still one question for you which might be a missing media-type or alias in Tika, and if this is the case I will open an issue in Tika's bug control system. The GZIP format is as follows in Wikipedia: http://en.wikipedia.org/wiki/Gzip The MediaType according to Wikipedia is application/gzip, while in the TIKA DB it is: "*application/x-gzip*" and the "*application/gzip*" is totally left out (not even an alias) !? Is it a "bug" or am I missing something ? On Thu, Apr 24, 2014 at 1:55 PM, Nick Burch <[email protected]> wrote: > On Thu, 24 Apr 2014, אברהם חיון wrote: > >> These two are aliases. You might need to check you're using the canonical >>> form >>> >>> *Can you please elaborate? What is the difference between the alias >> and >> the canonical form ?* >> > > From the Tika mimetypes file: > > <mime-type type="application/xml"> > <acronym>XML</acronym> > <_comment>Extensible Markup Language</_comment> > <tika:link>http://en.wikipedia.org/wiki/Xml</tika:link> > <tika:uti>public.xml</tika:uti> > <alias type="text/xml"/> > > So, the official / canonical mimetype is application/xml, while text/xml > is an alias for it. > > MediaTypeRegistry - http://tika.apache.org/1.5/api/org/apache/tika/mime/ > MediaTypeRegistry.html - can give you the aliases for a given canonical > type. You can use the normalize call to turn the alias into the canonical > form if needed > > > Tika doesn't know about this, is it a common alias? >>> >> >> *Not used a lot, but several places list it as an XML type, like here:* >> *http://filext.com/file-extension/XML >> <http://filext.com/file-extension/XML>* >> *or* >> *http://help.dottoro.com/lapuadlp.php >> <http://help.dottoro.com/lapuadlp.php>* >> > > If they're commonly used aliases, please open a jira and suggest them > > *Where should I look to see the right and acceptable mediaType / aliases >> of >> every format ?* >> > > https://svn.apache.org/repos/asf/tika/trunk/tika-core/src/ > main/resources/org/apache/tika/mime/tika-mimetypes.xml > > Nick
