[ 
https://issues.apache.org/jira/browse/TIKA-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760023#action_12760023
 ] 

Ken Krugler commented on TIKA-285:
----------------------------------

The "file" command line utility also has a pretty good set of magic byte 
settings - we'd looked at it when working on Krugle. FWIR, it also has a 
slightly more sophisticated method for processing magic bytes than what Nutch 
(and I guess now Tika) has.

One of the issues we'd run into was the need to be able to use a regex against 
the header bytes to determine true file type, versus fixed offsets/values.


> Update media type registry to the latest httpd mime type database
> -----------------------------------------------------------------
>
>                 Key: TIKA-285
>                 URL: https://issues.apache.org/jira/browse/TIKA-285
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>
> The MIME type database included in the Apache HTTP Server is one of the more 
> complete and accurate media type and file extension resources out there.
> Their magic byte settings don't seem to be as complete as the ones in Tika, 
> but it would be good to check also those settings for extra information.
> ... and we should contribute any of the recent Tika settings back to httpd 
> where they don't already know of those details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to