On Mon, Aug 24, 2015 at 4:19 PM, Nick Burch <apache-5Jw25rjQhWFrovVCs/[email protected]> wrote: > Currently, Tika runs each detector independently, in priority order, > and allows on detector to "improve" (specialise) the results of a > previous one. > > I'm not sure why you want to pass the results of one detector to > another - > any chance you could clarify the use-case for that? Either in text, or > in code, whatever's easiest!
While writing a reply, I came to a conclusion that in my particular case I can move all "detection" into a parser code and wrap standard parsers. I feel like nothing prevents me from changing a content type in metadata from parser code if I really want that. I guess it is that subtle difference between detection vs parsing that confused me initially. I can't think of other, general use case for such "nesting". I thought it could be used to modularize things, like some other "zipped-up" format detector outside of ZipContainerDetector, but it would require passing lots of other things around to reduce overhead :( Below is an original reasoning why I wanted "nested" detection. The overall goal I have is to recognize subsets of a few standard types like certain CSV (regardless of file extension) and Excel files based on a content, e.g. locally standardized certain cell values. To achieved that with minimal efforts, it would be nice to already know that a given file is, e.g., an Excel file and just open it and check few extra things. As of now, I'm not certain on how to deliver results after that, i.e. with extra attributes or as a descendant type. I was going to use a separate custom MIME type and a separate detector. If there is no information whether a file being detected is suspected to be of a certain type so far, then I'd have to do a number of redundant steps, probably already done be default detector, like checking whether it is a ZIP container and is an Excel. -- Mikhail
