On Wed, Aug 26, 2015 at  6:11 AM, Nick Burch 
<apache-5Jw25rjQhWFrovVCs/[email protected]> wrote:
> You probably shouldn't be defining additional mimetypes to
> DefaultParser.

I had an impression that indeed there should be no explicit definition
and new types should be hooked up to a default parser automatically via
the service loader. But my point is, it is not enough with 1.10 if and only
if default parser in config is wrapped.

> Give it child parsers that support those additional
> mimetypes. If there's no child parser registered for a given mimetype,
> then binding another mime type to DefaultParser won't help

That is another point of confusion. I thought that Tika somehow
enumerates parsers and registers things on its own using
getSupportedTypes() but apparently I have to be more explicit.

> You probably shouldn't be wrapping your own parser around
> DefaultParser in config. If you really need to do that, to decorate
> some how do it in code

I didn't investigate things in the beginning. Apparently I have to use
different POI classes to read new and old Excel formats. That was the
incentive to piggyback on whatever appropriate parser is. Since it is
not the case I do have to be specific. I'm just puzzled, if one better
be specific while decorating a parser, why not to simply derive from
that parser instead of decoration?

Long story short, I'm not wrapping it anymore.

> If you want Default Parser and your own one, do something like:
>
> <parsers>
>   <parser class="org.apache.tika.parser.DefaultParser" />
>   <parser class="my.tika.parser.ExcelParser">
>     <!-- any mimetypes special to this -->
>   </parser>
> </parsers>

I had this for a while until I realized that my parser (extended from
AbstractParser) is not getting metadata from OOXMLParser this way. Also
I'm confused how this is supposed to reconcile with "Currently, it is
only possible to have a single parser run against a document"[1]?

I do exclude excel types from default parser and extend my parser from
OOXMLParser. This way I can piggyback on metadata extraction while
discarding content using dummy handler (mostly numbers not to pollute
search engine).

And to consolidate threads...

> Alfresco needs a very old version of ASM, so take care when upgrading Tika

Thanks for a heads up. I just looked up what ASM is all about. It looks
serious. Without parser decorator, I am able to roll back and use Tika
1.6. And it looks like[2] there might be a bump to at least 1.9.

Footnotes: 
[1]  https://tika.apache.org/1.10/configuring.html#Configuring_Parsers

[2]  https://issues.alfresco.com/jira/browse/ACE-4055

-- 
Mikhail

Reply via email to