On 8. okt. 2010, at 13.42, Nick Burch wrote: > On Fri, 8 Oct 2010, Jan Høydahl / Cominvent wrote: >> Magic is most often great, but I generally prefer to have some way of >> explicitly telling the software what to do :) > > That's very much available to you! See the different constructors to the > AutoDetectParser for examples of how to control what detector is used, what > parsers get used etc
I'm mainly concerned with the use case where you don't use Tika as a library but as part of a product like Solr or Nutch. You don't want to write a patch to Solr's use of Tika in e.g. DIH and ExtractingRequestHandler. Tika gives you the pluggability of new parsers simply by dropping in a jar in the classpath, no matter in what application Tika is embedded. I'm looking for similar ways to configure Tika for advanced cases. >> Now you discover that you prefer another parser for some of the formats >> which the 3rd party plugin "hi-jacked". You can't modify their source code, >> so how do you tell Tika this? > > At that point your uses are probably sufficiently different to the default > that you shouldn't be using the no-argument AutoDetectParser constructor! > >> I propose an optional config file which, if found, overrides the mime types >> specified - if the specified class is found and says it supports the mime >> type of course. > > Or you could just have a regular Tika config file, and list in there only the > parsers you're interested in using? Ah, I thought Tika config file was deprecated as of 0.7? How does Tika behave if you give it a tika-config.xml and at the same time give it a bunch of external parser plugin jar's? Do you need to list all of them in tika-config? If you need to specify everything explicitly, perhaps a way to go is extending the xml format to allow an more compact override syntax, which does not nuke the auto-wired parsers. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com
