On Tue, Aug 25, 2015 at 5:07 AM, Nick Burch <apache-5Jw25rjQhWFrovVCs/[email protected]> wrote: > On Mon, 24 Aug 2015, Mikhail Titov wrote: >> On Mon, Aug 24, 2015 at 6:14 PM, Mikhail Titov >> <mlt-0UDz38MK/[email protected]> wrote: >>> While writing a reply, I came to a conclusion that in my particular case >>> I can move all "detection" into a parser code and wrap standard parsers. >> >> Is parser decorator the way to go if I want to dig few more things on >> top of existing parser output? > > ContentHandler is another one
I confirm that parser decorator defined via config xml does work in 1.10. It did solve my problem. I can discern specific subset of Excel files now. ContentHandler isn't quite suitable for my needs as I do expect to find things at certain cells in original document and not in a serialized form of SAX events. >> P.S. I'm using Tika 1.9 at this moment. > > That's probably part of your issue. Please retry with 1.10, as we did > quite a bit of work on tika config for parsers in that release I refrained from 1.10 upgrade initially as it broke other unit tests I had. I didn't chime in before the release as I wasn't sure if I was doing something wrong being new to Tika. As it turned out, with 1.10 I do have to explicitly declare default (or specific if services aren't used) parser for a new MIME type in config xml if and only if I have parser decorator wrapping default parser in that file as well. With 1.9, it was enough to add a class that supports given MIME type (via getSupportedTypes) to META-INF/services/org.apache.tika.parser.Parser . I feel like it might worth mentioning. Here is an example. I have a parser supporting text/toa5 listed in services. ,----[ custom mime xml ] | <mime-type type="text/toa5"> | <_comment>Capmbell Scientific table oriented ascii file</_comment> | <magic priority="50"> | <match value="TOA5" type="string" offset="1" /> | </magic> | <glob pattern="*.dat"/> | <sub-class-of type="text/csv"/> | </mime-type> `---- The following will break automatic parser calling for text/toa5 in 1.10 but not in 1.9 ,----[ tika config xml ] | <?xml version="1.0" encoding="UTF-8"?> | <properties> | <parsers> | <!-- <parser class="org.apache.tika.parser.DefaultParser"> --> | <!-- <mime>text/toa5</mime> --> | <!-- </parser> --> | <parser class="my.tika.parser.ExcelParser"> | <parser class="org.apache.tika.parser.DefaultParser" /> | </parser> | </parsers> | </properties> `---- 1.10 does require removal of those commented lines. Is it a regression? I can't see anything like that exactly in 1.10 change log, though there are general notes that parser related things were redone. -- Mikhail
