Re: Extensible content type detection

Sami Siren Sun, 18 Jan 2009 22:25:47 -0800

Jukka Zitting wrote:

Hi,


I've been thinking about how we currently do content type detection in
Tika and how we could improve things by making the type detection code
more modular and easier to extend. See TIKA-95 for some background.

I now think I have a pretty good idea on how to do this. See below for
a proposed Detector interface that's based on similar ideas as the
Parser interface that's worked really well for us. I would have
separate Detector classes for all the kinds of type detection
mechanisms we have (resource name, content type hint, magic bytes) and
may come up with int he future. In addition we'd have something like a
CompositeDetector class that delegates the detection task to
configured individual detectors and selects the most specific
resulting media type as the result of the whole type detection
process.

WDYT?

I like the idea, it allows us to use different strategies for detectingthe type for individual formats or change the whole strategy used. Onlything that I am wondering is should we introduce some kind of confidencelevel to the guesses , perhaps part of metadata?


--
Sami Siren

Re: Extensible content type detection

Reply via email to