On Fri, May 22, 2009 at 10:45 PM, Jukka Zitting <jukka.zitt...@gmail.com> wrote: > Hi, > > On Thu, May 21, 2009 at 7:48 PM, Robert Burrell Donkin > <robertburrelldon...@gmail.com> wrote: >> A. from the basic user perspective, the quick start way to mime type is to >> >> 1. Use MimeTypesFactory#createMimeTypes() to create a MimeTypes with >> the default tika configuration >> 2. if you want just name based heuristics call getMimeType passing a >> file, url or name >> 3. if you want full typing heuristics including magic call getMimeType >> passing an input stream > > Yeah. That's the original mechanism we've had in place since Tika 0.1. > It works, but I'm not entirely happy with the current MimeTypes > mechanism (see TIKA-87 and TIKA-89). Most notably the MimeTypes class > is hard to configure or extend. I'm hoping to refactor things before > we reach Tika 1.0. > > The current best practice for type detection would be to use the > Detector interface and the MimeTypes class as a Detector > implementation. The MimeTypes.detect() method currently contains the > best detection heuristics we have. That's also what the > AutoDetectParser is using for automatic type detection. > >> B. from an advanced user perspective, the heuristics can be customised by >> >> 1.passing a different configuration file to >> MimeTypesFactory#createMimeTypes(XYZ) >> 2 & 3 as above > > Yep. The type configuration included in Tika is already quite good, > but there are still lots of details missing. Contributions are > welcome... > > For per-application customizations the current best practice is to > take a copy of the existing type configuration file from Tika and > modify it. Note that you'll need to update this copy per each Tika > upgrade to get the latest improvements. TIKA-87 should solve this > problem. > >> C. developers of new detectors should take a look at the detector >> interface and then customise as above > > We don't yet have a configuration mechanism for Detector > implementations, but I would still recommend any custom detection > algorithms to be implemented using the Detector interface. The > CompositeDetector class makes it easy to combine custom detectors with > the default functionality in Tika: > > Detector composite = new CompositeDetector( > Arrays.asList(new MyCustomDetector(), MimeTypesFactory.create(...))); > > The composite detector will use each of the given component detectors > in sequence and will return the most specific detected media type.
ok - i'll make a start at writing up some documentation should i add it to the bottom of http://lucene.apache.org/tika/documentation.html or would a separate document be better? - robert