Re: Extensible content type detection

Niall Pemberton Mon, 19 Jan 2009 12:47:36 -0800

On Sat, Jan 17, 2009 at 10:57 PM, Jukka Zitting <jukka.zitt...@gmail.com> wrote:
> Hi,
>
> I've been thinking about how we currently do content type detection in
> Tika and how we could improve things by making the type detection code
> more modular and easier to extend. See TIKA-95 for some background.
>
> I now think I have a pretty good idea on how to do this. See below for
> a proposed Detector interface that's based on similar ideas as the
> Parser interface that's worked really well for us. I would have
> separate Detector classes for all the kinds of type detection
> mechanisms we have (resource name, content type hint, magic bytes) and
> may come up with int he future. In addition we'd have something like a
> CompositeDetector class that delegates the detection task to
> configured individual detectors and selects the most specific
> resulting media type as the result of the whole type detection
> process.
>
> WDYT?


What about using a (read-only?) ByteBuffer[1] rather than InputStream
to avoid the issue of implementations doing things with the
InputStream that they shouldn't?

[1] http://java.sun.com/j2se/1.5.0/docs/api/java/nio/class-use/ByteBuffer.html

Niall

> BR,
>
> Jukka Zitting
>
>
> package org.apache.tika.detect;
>
> import java.io.IOException;
> import java.io.InputStream;
>
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.mime.MimeType;
>
> /**
>  * Content type detector. Implementations of this interface use various
>  * heuristics to detect the content type of a document based on given
>  * input metadata or the first few bytes of the document stream.
>  *
>  * @since Apache Tika 0.3
>  */
> public interface Detector {
>
>    /**
>     * Detects the content type of the given input document. Returns
>     * <code>application/octet-stream</code> if the type of the document
>     * can not be detected.
>     * <p>
>     * If the document input stream is not available, then the first
>     * argument may be <code>null</code>. Otherwise the detector is may
>     * read a bounded number of bytes from the start of the stream to help
>     * in type detection. The stream must not be closed or otherwise
>     * manipulated other by simply reading bytes from it, as the caller
>     * may use the mark feature to be able to reset the stream to the
>     * beginning for proper parsing when the content type is detected.
>     * For the same reason the detector must only read up to a limited
>     * number of bytes from the stream to avoid potentially unbounded
>     * memory use for the buffer of a marked a stream.
>     * <p>
>     * The given input metadata is only read, not modified, by the detector.
>     *
>     * @param input document input stream, or <code>null</code>
>     * @param metadata input metadata for the document
>     * @return detected media type, or <code>application/octet-stream</code>
>     * @throws IOException if the document input stream could not be read
>     */
>    MimeType detect(InputStream input, Metadata metadata) throws IOException;
>
> }
>

Re: Extensible content type detection

Reply via email to