Hi, On Thu, May 21, 2009 at 7:48 PM, Robert Burrell Donkin <robertburrelldon...@gmail.com> wrote: > A. from the basic user perspective, the quick start way to mime type is to > > 1. Use MimeTypesFactory#createMimeTypes() to create a MimeTypes with > the default tika configuration > 2. if you want just name based heuristics call getMimeType passing a > file, url or name > 3. if you want full typing heuristics including magic call getMimeType > passing an input stream
Yeah. That's the original mechanism we've had in place since Tika 0.1. It works, but I'm not entirely happy with the current MimeTypes mechanism (see TIKA-87 and TIKA-89). Most notably the MimeTypes class is hard to configure or extend. I'm hoping to refactor things before we reach Tika 1.0. The current best practice for type detection would be to use the Detector interface and the MimeTypes class as a Detector implementation. The MimeTypes.detect() method currently contains the best detection heuristics we have. That's also what the AutoDetectParser is using for automatic type detection. > B. from an advanced user perspective, the heuristics can be customised by > > 1.passing a different configuration file to > MimeTypesFactory#createMimeTypes(XYZ) > 2 & 3 as above Yep. The type configuration included in Tika is already quite good, but there are still lots of details missing. Contributions are welcome... For per-application customizations the current best practice is to take a copy of the existing type configuration file from Tika and modify it. Note that you'll need to update this copy per each Tika upgrade to get the latest improvements. TIKA-87 should solve this problem. > C. developers of new detectors should take a look at the detector > interface and then customise as above We don't yet have a configuration mechanism for Detector implementations, but I would still recommend any custom detection algorithms to be implemented using the Detector interface. The CompositeDetector class makes it easy to combine custom detectors with the default functionality in Tika: Detector composite = new CompositeDetector( Arrays.asList(new MyCustomDetector(), MimeTypesFactory.create(...))); The composite detector will use each of the given component detectors in sequence and will return the most specific detected media type. BR, Jukka Zitting