On Sat, Jan 17, 2009 at 10:57 PM, Jukka Zitting <jukka.zitt...@gmail.com> wrote: > Hi, > > I've been thinking about how we currently do content type detection in > Tika and how we could improve things by making the type detection code > more modular and easier to extend. See TIKA-95 for some background. > > I now think I have a pretty good idea on how to do this. See below for > a proposed Detector interface that's based on similar ideas as the > Parser interface that's worked really well for us. I would have > separate Detector classes for all the kinds of type detection > mechanisms we have (resource name, content type hint, magic bytes) and > may come up with int he future. In addition we'd have something like a > CompositeDetector class that delegates the detection task to > configured individual detectors and selects the most specific > resulting media type as the result of the whole type detection > process. > > WDYT?
What about using a (read-only?) ByteBuffer[1] rather than InputStream to avoid the issue of implementations doing things with the InputStream that they shouldn't? [1] http://java.sun.com/j2se/1.5.0/docs/api/java/nio/class-use/ByteBuffer.html Niall > BR, > > Jukka Zitting > > > package org.apache.tika.detect; > > import java.io.IOException; > import java.io.InputStream; > > import org.apache.tika.metadata.Metadata; > import org.apache.tika.mime.MimeType; > > /** > * Content type detector. Implementations of this interface use various > * heuristics to detect the content type of a document based on given > * input metadata or the first few bytes of the document stream. > * > * @since Apache Tika 0.3 > */ > public interface Detector { > > /** > * Detects the content type of the given input document. Returns > * <code>application/octet-stream</code> if the type of the document > * can not be detected. > * <p> > * If the document input stream is not available, then the first > * argument may be <code>null</code>. Otherwise the detector is may > * read a bounded number of bytes from the start of the stream to help > * in type detection. The stream must not be closed or otherwise > * manipulated other by simply reading bytes from it, as the caller > * may use the mark feature to be able to reset the stream to the > * beginning for proper parsing when the content type is detected. > * For the same reason the detector must only read up to a limited > * number of bytes from the stream to avoid potentially unbounded > * memory use for the buffer of a marked a stream. > * <p> > * The given input metadata is only read, not modified, by the detector. > * > * @param input document input stream, or <code>null</code> > * @param metadata input metadata for the document > * @return detected media type, or <code>application/octet-stream</code> > * @throws IOException if the document input stream could not be read > */ > MimeType detect(InputStream input, Metadata metadata) throws IOException; > > } >