On Tue, Aug 25, 2015 at  5:07 AM, Nick Burch 
<apache-5Jw25rjQhWFrovVCs/[email protected]> wrote:
> On Mon, 24 Aug 2015, Mikhail Titov wrote:
>> On Mon, Aug 24, 2015 at  6:14 PM, Mikhail Titov 
>> <mlt-0UDz38MK/[email protected]> wrote:
>>> While writing a reply, I came to a conclusion that in my particular case
>>> I can move all "detection" into a parser code and wrap standard parsers.
>>
>> Is parser decorator the way to go if I want to dig few more things on
>> top of existing parser output?
>
> ContentHandler is another one

I confirm that parser decorator defined via config xml does work in
1.10. It did solve my problem. I can discern specific subset of Excel
files now.

ContentHandler isn't quite suitable for my needs as I do expect to find
things at certain cells in original document and not in a serialized
form of SAX events.

>> P.S. I'm using Tika 1.9 at this moment.
>
> That's probably part of your issue. Please retry with 1.10, as we did
> quite a bit of work on tika config for parsers in that release

I refrained from 1.10 upgrade initially as it broke other unit tests I
had. I didn't chime in before the release as I wasn't sure if I was
doing something wrong being new to Tika.

As it turned out, with 1.10 I do have to explicitly declare default (or
specific if services aren't used) parser for a new MIME type in config
xml if and only if I have parser decorator wrapping default parser in
that file as well. With 1.9, it was enough to add a class that supports
given MIME type (via getSupportedTypes) to
META-INF/services/org.apache.tika.parser.Parser . I feel like it might
worth mentioning.

Here is an example. I have a parser supporting text/toa5 listed in services.

,----[ custom mime xml ]
|   <mime-type type="text/toa5">
|      <_comment>Capmbell Scientific table oriented ascii file</_comment>
|      <magic priority="50">
|         <match value="TOA5" type="string" offset="1" />
|      </magic>
|      <glob pattern="*.dat"/>
|      <sub-class-of type="text/csv"/>
|   </mime-type>
`----

The following will break automatic parser calling for text/toa5 in 1.10
but not in 1.9

,----[ tika config xml ]
| <?xml version="1.0" encoding="UTF-8"?>
| <properties>
|       <parsers>
| <!--          <parser class="org.apache.tika.parser.DefaultParser"> -->
| <!--                  <mime>text/toa5</mime> -->
| <!--          </parser> -->
|               <parser class="my.tika.parser.ExcelParser">
|                       <parser class="org.apache.tika.parser.DefaultParser" />
|               </parser>
|       </parsers>
| </properties>
`----

1.10 does require removal of those commented lines.

Is it a regression? I can't see anything like that exactly in 1.10
change log, though there are general notes that parser related things
were redone.

-- 
Mikhail

Reply via email to