Am Mittwoch, den 08.04.2009, 15:58 +0200 schrieb Jukka Zitting:
> Hi,
> 
> Revisiting a topic that we've considered already before (in at least
> [1], [2] and [3])...
> 
> I'm working on integrating Tika to Jackrabbit [4], and there we found
> it desirable [5] to make it easier to depend on just the core Tika
> classes without all the parser dependencies.
> 
> To make this happen, I'd split Tika into following component libraries:
> 
> * tika-core - core parts of Tika; everything but cli, gui, and the
> parser.* packages
> * tika-parsers - format-specific parser classes; with dependencies to
> external libraries
> * tika-app - depends on all of the above; adds cli and gui; standalone
> jar packaging
> 
> We could (should?) further split the tika-parsers component into
> smaller pieces based on the external dependencies used to allow
> finer-grained control over what parser libraries get included in a
> specific downstream package or deployment.
> 
> WDYT? If there are no objections, I'd like to target this for the Tika
> 0.4 release.
> 
> [1] http://markmail.org/message/n64zb3cawlm4ng3k
> [2] http://markmail.org/message/ji3xabugnt6wlwdh
> [3] http://markmail.org/message/2sd6d5ajhpqhcwcf
> [4] https://issues.apache.org/jira/browse/JCR-1878
> [5] http://markmail.org/message/cf6bj7qv7fyyxezu
> 
> BR,
> 
> Jukka Zitting
> 

+1

In my use case, it would be ideal to add custom parsers to the auto
detection "on the fly".

I have different ideas how to implement that
a) make TikaConfig more flexible by adding setters for the parsers

e.g.
TikaConfig conf = TikaConfig.getDefaultConfig();
//CHANGED
conf.setParser("application/custom",MyCustomParserClass);
//
AutoDetectParser parser = new AutoDetectParser(conf);

This is trivial and almost non-intrusive, but leaves the work on the
client side

b) extend the Parser interface to let the parsers themselves report
their capabilities (something like MyParser.getSupportedTypes()) and add
some class loading magic, e.g. specify a plugin directory in the config
file and load every class in there. 
My guts tell me, this could be a hack if not done right. But I like the
administrative view of this.

c) let a professional do it, like OSGi (Apache Felix)
Allows elegant runtime changes. It adds, however, another (small)
dependency and needs changes in the structure.


What do you think ? I would implement one of this (or similar ones), if
there is an interest and no conflict with other plans.

René Wiermer

Reply via email to