Am Mittwoch, den 08.04.2009, 15:58 +0200 schrieb Jukka Zitting: > Hi, > > Revisiting a topic that we've considered already before (in at least > [1], [2] and [3])... > > I'm working on integrating Tika to Jackrabbit [4], and there we found > it desirable [5] to make it easier to depend on just the core Tika > classes without all the parser dependencies. > > To make this happen, I'd split Tika into following component libraries: > > * tika-core - core parts of Tika; everything but cli, gui, and the > parser.* packages > * tika-parsers - format-specific parser classes; with dependencies to > external libraries > * tika-app - depends on all of the above; adds cli and gui; standalone > jar packaging > > We could (should?) further split the tika-parsers component into > smaller pieces based on the external dependencies used to allow > finer-grained control over what parser libraries get included in a > specific downstream package or deployment. > > WDYT? If there are no objections, I'd like to target this for the Tika > 0.4 release. > > [1] http://markmail.org/message/n64zb3cawlm4ng3k > [2] http://markmail.org/message/ji3xabugnt6wlwdh > [3] http://markmail.org/message/2sd6d5ajhpqhcwcf > [4] https://issues.apache.org/jira/browse/JCR-1878 > [5] http://markmail.org/message/cf6bj7qv7fyyxezu > > BR, > > Jukka Zitting >
+1 In my use case, it would be ideal to add custom parsers to the auto detection "on the fly". I have different ideas how to implement that a) make TikaConfig more flexible by adding setters for the parsers e.g. TikaConfig conf = TikaConfig.getDefaultConfig(); //CHANGED conf.setParser("application/custom",MyCustomParserClass); // AutoDetectParser parser = new AutoDetectParser(conf); This is trivial and almost non-intrusive, but leaves the work on the client side b) extend the Parser interface to let the parsers themselves report their capabilities (something like MyParser.getSupportedTypes()) and add some class loading magic, e.g. specify a plugin directory in the config file and load every class in there. My guts tell me, this could be a hack if not done right. But I like the administrative view of this. c) let a professional do it, like OSGi (Apache Felix) Allows elegant runtime changes. It adds, however, another (small) dependency and needs changes in the structure. What do you think ? I would implement one of this (or similar ones), if there is an interest and no conflict with other plans. René Wiermer