Re: Can I add custom detector to be called last to parse common containers' subtypes?

Mikhail Titov Mon, 24 Aug 2015 16:15:06 -0700

On Mon, Aug 24, 2015 at  4:19 PM, Nick Burch 
<apache-5Jw25rjQhWFrovVCs/[email protected]> wrote:
> Currently, Tika runs each detector independently, in priority order,
> and allows on detector to "improve" (specialise) the results of a
> previous one.
>
> I'm not sure why you want to pass the results of one detector to
> another - 
> any chance you could clarify the use-case for that? Either in text, or
> in code, whatever's easiest!


While writing a reply, I came to a conclusion that in my particular case
I can move all "detection" into a parser code and wrap standard parsers.
I feel like nothing prevents me from changing a content type in metadata
from parser code if I really want that. I guess it is that subtle
difference between detection vs parsing that confused me initially.

I can't think of other, general use case for such "nesting". I thought
it could be used to modularize things, like some other "zipped-up"
format detector outside of ZipContainerDetector, but it would require
passing lots of other things around to reduce overhead :( Below is an
original reasoning why I wanted "nested" detection.


The overall goal I have is to recognize subsets of a few standard types
like certain CSV (regardless of file extension) and Excel files based on
a content, e.g. locally standardized certain cell values.

To achieved that with minimal efforts, it would be nice to already know
that a given file is, e.g., an Excel file and just open it and check few
extra things. As of now, I'm not certain on how to deliver results after
that, i.e. with extra attributes or as a descendant type.

I was going to use a separate custom MIME type and a separate
detector. If there is no information whether a file being detected is
suspected to be of a certain type so far, then I'd have to do a number
of redundant steps, probably already done be default detector, like
checking whether it is a ZIP container and is an Excel.

-- 
Mikhail

Re: Can I add custom detector to be called last to parse common containers' subtypes?

Reply via email to