On Mon, 24 Aug 2015, Mikhail Titov wrote:
While writing a reply, I came to a conclusion that in my particular case I can move all "detection" into a parser code and wrap standard parsers. I feel like nothing prevents me from changing a content type in metadata from parser code if I really want that. I guess it is that subtle difference between detection vs parsing that confused me initially.

Parsing needs Detection. Detection can be used standalone. Detection is "what is it", Parsing is "what does it contain"

I guess it comes down to if you think these special spreadsheets logically count as a distinct file type or not?

I can't think of other, general use case for such "nesting". I thought
it could be used to modularize things, like some other "zipped-up"
format detector outside of ZipContainerDetector, but it would require
passing lots of other things around to reduce overhead :( Below is an
original reasoning why I wanted "nested" detection.

TikaInputStream supports an Open Container, which allows for some of the work done in detection to be re-used by parsing, but that's about as far as we go

The overall goal I have is to recognize subsets of a few standard types
like certain CSV (regardless of file extension) and Excel files based on
a content, e.g. locally standardized certain cell values.

Given that you'd probably have to have done 80% of the parsing work to spot what's in particular cells, I don't think that's a detection thing. If possible, I'd suggest doing it with a content handler. Have the content handler check for the cells, and re-format it / pull out additional bits / tweak bits as needed from there

There's one example linked in the examples of doing something a little like that:
http://tika.apache.org/1.10/examples.html#Custom_Content_Handlers

Thanks
Nick

Reply via email to