Jukka Zitting wrote:
Hi,

I've been thinking about how we currently do content type detection in
Tika and how we could improve things by making the type detection code
more modular and easier to extend. See TIKA-95 for some background.

I now think I have a pretty good idea on how to do this. See below for
a proposed Detector interface that's based on similar ideas as the
Parser interface that's worked really well for us. I would have
separate Detector classes for all the kinds of type detection
mechanisms we have (resource name, content type hint, magic bytes) and
may come up with int he future. In addition we'd have something like a
CompositeDetector class that delegates the detection task to
configured individual detectors and selects the most specific
resulting media type as the result of the whole type detection
process.

WDYT?

I like the idea, it allows us to use different strategies for detecting the type for individual formats or change the whole strategy used. Only thing that I am wondering is should we introduce some kind of confidence level to the guesses , perhaps part of metadata?

--
Sami Siren


Reply via email to