On Wed, Aug 26, 2015 at 6:11 AM, Nick Burch <apache-5Jw25rjQhWFrovVCs/[email protected]> wrote: > You probably shouldn't be defining additional mimetypes to > DefaultParser.
I had an impression that indeed there should be no explicit definition and new types should be hooked up to a default parser automatically via the service loader. But my point is, it is not enough with 1.10 if and only if default parser in config is wrapped. > Give it child parsers that support those additional > mimetypes. If there's no child parser registered for a given mimetype, > then binding another mime type to DefaultParser won't help That is another point of confusion. I thought that Tika somehow enumerates parsers and registers things on its own using getSupportedTypes() but apparently I have to be more explicit. > You probably shouldn't be wrapping your own parser around > DefaultParser in config. If you really need to do that, to decorate > some how do it in code I didn't investigate things in the beginning. Apparently I have to use different POI classes to read new and old Excel formats. That was the incentive to piggyback on whatever appropriate parser is. Since it is not the case I do have to be specific. I'm just puzzled, if one better be specific while decorating a parser, why not to simply derive from that parser instead of decoration? Long story short, I'm not wrapping it anymore. > If you want Default Parser and your own one, do something like: > > <parsers> > <parser class="org.apache.tika.parser.DefaultParser" /> > <parser class="my.tika.parser.ExcelParser"> > <!-- any mimetypes special to this --> > </parser> > </parsers> I had this for a while until I realized that my parser (extended from AbstractParser) is not getting metadata from OOXMLParser this way. Also I'm confused how this is supposed to reconcile with "Currently, it is only possible to have a single parser run against a document"[1]? I do exclude excel types from default parser and extend my parser from OOXMLParser. This way I can piggyback on metadata extraction while discarding content using dummy handler (mostly numbers not to pollute search engine). And to consolidate threads... > Alfresco needs a very old version of ASM, so take care when upgrading Tika Thanks for a heads up. I just looked up what ASM is all about. It looks serious. Without parser decorator, I am able to roll back and use Tika 1.6. And it looks like[2] there might be a bump to at least 1.9. Footnotes: [1] https://tika.apache.org/1.10/configuring.html#Configuring_Parsers [2] https://issues.alfresco.com/jira/browse/ACE-4055 -- Mikhail
