On 8. okt. 2010, at 13.42, Nick Burch wrote:
> On Fri, 8 Oct 2010, Jan Høydahl / Cominvent wrote:
>> Magic is most often great, but I generally prefer to have some way of 
>> explicitly telling the software what to do :)
> 
> That's very much available to you! See the different constructors to the 
> AutoDetectParser for examples of how to control what detector is used, what 
> parsers get used etc

I'm mainly concerned with the use case where you don't use Tika as a library 
but as part of a product like Solr or Nutch. You don't want to write a patch to 
Solr's use of Tika in e.g. DIH and ExtractingRequestHandler. Tika gives you the 
pluggability of new parsers simply by dropping in a jar in the classpath, no 
matter in what application Tika is embedded. I'm looking for similar ways to 
configure Tika for advanced cases.

>> Now you discover that you prefer another parser for some of the formats 
>> which the 3rd party plugin "hi-jacked". You can't modify their source code, 
>> so how do you tell Tika this?
> 
> At that point your uses are probably sufficiently different to the default 
> that you shouldn't be using the no-argument AutoDetectParser constructor!
> 
>> I propose an optional config file which, if found, overrides the mime types 
>> specified - if the specified class is found and says it supports the mime 
>> type of course.
> 
> Or you could just have a regular Tika config file, and list in there only the 
> parsers you're interested in using?

Ah, I thought Tika config file was deprecated as of 0.7? How does Tika behave 
if you give it a tika-config.xml and at the same time give it a bunch of 
external parser plugin jar's? Do you need to list all of them in tika-config? 
If you need to specify everything explicitly, perhaps a way to go is extending 
the xml format to allow an more compact override syntax, which does not nuke 
the auto-wired parsers.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

Reply via email to