Re: Can I add custom detector to be called last to parse common containers' subtypes?

Nick Burch Tue, 25 Aug 2015 03:00:07 -0700

On Mon, 24 Aug 2015, Mikhail Titov wrote:

While writing a reply, I came to a conclusion that in my particular caseI can move all "detection" into a parser code and wrap standard parsers.I feel like nothing prevents me from changing a content type in metadatafrom parser code if I really want that. I guess it is that subtledifference between detection vs parsing that confused me initially.

Parsing needs Detection. Detection can be used standalone. Detection is"what is it", Parsing is "what does it contain"

I guess it comes down to if you think these special spreadsheets logicallycount as a distinct file type or not?

I can't think of other, general use case for such "nesting". I thought
it could be used to modularize things, like some other "zipped-up"
format detector outside of ZipContainerDetector, but it would require
passing lots of other things around to reduce overhead :( Below is an
original reasoning why I wanted "nested" detection.

TikaInputStream supports an Open Container, which allows for some of thework done in detection to be re-used by parsing, but that's about as faras we go

The overall goal I have is to recognize subsets of a few standard types
like certain CSV (regardless of file extension) and Excel files based on
a content, e.g. locally standardized certain cell values.

Given that you'd probably have to have done 80% of the parsing work tospot what's in particular cells, I don't think that's a detection thing.If possible, I'd suggest doing it with a content handler. Have the contenthandler check for the cells, and re-format it / pull out additional bits /tweak bits as needed from there

There's one example linked in the examples of doing something a littlelike that:

http://tika.apache.org/1.10/examples.html#Custom_Content_Handlers

Thanks
Nick

Re: Can I add custom detector to be called last to parse common containers' subtypes?

Reply via email to