On Fri, May 22, 2009 at 10:45 PM, Jukka Zitting <jukka.zitt...@gmail.com> wrote:
> Hi,
>
> On Thu, May 21, 2009 at 7:48 PM, Robert Burrell Donkin
> <robertburrelldon...@gmail.com> wrote:
>> A. from the basic user perspective, the quick start way to mime type is to
>>
>> 1. Use MimeTypesFactory#createMimeTypes() to create a MimeTypes with
>> the default tika configuration
>> 2. if you want just name based heuristics call getMimeType passing a
>> file, url or name
>> 3. if you want full typing heuristics including magic call getMimeType
>> passing an input stream
>
> Yeah. That's the original mechanism we've had in place since Tika 0.1.
> It works, but I'm not entirely happy with the current MimeTypes
> mechanism (see TIKA-87 and TIKA-89). Most notably the MimeTypes class
> is hard to configure or extend. I'm hoping to refactor things before
> we reach Tika 1.0.
>
> The current best practice for type detection would be to use the
> Detector interface and the MimeTypes class as a Detector
> implementation. The MimeTypes.detect() method currently contains the
> best detection heuristics we have. That's also what the
> AutoDetectParser is using for automatic type detection.
>
>> B. from an advanced user perspective, the heuristics can be customised by
>>
>> 1.passing a different configuration file to
>> MimeTypesFactory#createMimeTypes(XYZ)
>> 2 & 3 as above
>
> Yep. The type configuration included in Tika is already quite good,
> but there are still lots of details missing. Contributions are
> welcome...
>
> For per-application customizations the current best practice is to
> take a copy of the existing type configuration file from Tika and
> modify it. Note that you'll need to update this copy per each Tika
> upgrade to get the latest improvements. TIKA-87 should solve this
> problem.
>
>> C. developers of new detectors should take a look at the detector
>> interface and then customise as above
>
> We don't yet have a configuration mechanism for Detector
> implementations, but I would still recommend any custom detection
> algorithms to be implemented using the Detector interface. The
> CompositeDetector class makes it easy to combine custom detectors with
> the default functionality in Tika:
>
>    Detector composite = new CompositeDetector(
>        Arrays.asList(new MyCustomDetector(), MimeTypesFactory.create(...)));
>
> The composite detector will use each of the given component detectors
> in sequence and will return the most specific detected media type.

ok - i'll make a start at writing up some documentation

should i add it to the bottom of
http://lucene.apache.org/tika/documentation.html or would a separate
document be better?

- robert

Reply via email to