Nick, thank you for everything.

I humbly accept all of your comments and will check the mediatype then
recurse through supertypes.
I will also check aliases of my expected Media-types to enhance the
media-type recognition.


I pitched the team to drop support for unrecognized (by Tika) media-types,
and if Tika decides to insert them into it's registry then we will support
them automatically.



I have still one question for you which might be a missing media-type or
alias in Tika, and if this is the case I will open an issue in Tika's bug
control system.


The GZIP format is as follows in Wikipedia:
http://en.wikipedia.org/wiki/Gzip

The MediaType according to Wikipedia is application/gzip, while in the TIKA
DB it is: "*application/x-gzip*" and the "*application/gzip*"  is totally
left out (not even an alias) !?

Is it a "bug" or am I missing something ?







On Thu, Apr 24, 2014 at 1:55 PM, Nick Burch <[email protected]> wrote:

> On Thu, 24 Apr 2014, אברהם חיון wrote:
>
>> These two are aliases. You might need to check you're using the canonical
>>> form
>>>
>>>  *Can you please elaborate?   What is the difference between the alias
>> and
>> the canonical form ?*
>>
>
> From the Tika mimetypes file:
>
>   <mime-type type="application/xml">
>     <acronym>XML</acronym>
>     <_comment>Extensible Markup Language</_comment>
>     <tika:link>http://en.wikipedia.org/wiki/Xml</tika:link>
>     <tika:uti>public.xml</tika:uti>
>     <alias type="text/xml"/>
>
> So, the official / canonical mimetype is application/xml, while text/xml
> is an alias for it.
>
> MediaTypeRegistry - http://tika.apache.org/1.5/api/org/apache/tika/mime/
> MediaTypeRegistry.html - can give you the aliases for a given canonical
> type. You can use the normalize call to turn the alias into the canonical
> form if needed
>
>
>  Tika doesn't know about this, is it a common alias?
>>>
>>
>> *Not used a lot, but several places list it as an XML type, like here:*
>> *http://filext.com/file-extension/XML
>> <http://filext.com/file-extension/XML>*
>> *or*
>> *http://help.dottoro.com/lapuadlp.php
>> <http://help.dottoro.com/lapuadlp.php>*
>>
>
> If they're commonly used aliases, please open a jira and suggest them
>
>  *Where should I look to see the right and acceptable mediaType / aliases
>> of
>> every format ?*
>>
>
> https://svn.apache.org/repos/asf/tika/trunk/tika-core/src/
> main/resources/org/apache/tika/mime/tika-mimetypes.xml
>
> Nick

Reply via email to