Splitting to three major components will certainly help re-usability, but too many small components may make it less convenient to use because of the large number of jars.
A different question: does tika plan to provide function for scraping web page? tika html parser provides everything on html page. for some applications such as search, it's required to exclude sections including advertising, menu, footer, etc. it would be extremely useful to have scraping capability in tika. Has anybody developed web page scraping code on top of tika? thanks, aj On Wed, Apr 8, 2009 at 6:58 AM, Jukka Zitting <jukka.zitt...@gmail.com>wrote: > Hi, > > Revisiting a topic that we've considered already before (in at least > [1], [2] and [3])... > > I'm working on integrating Tika to Jackrabbit [4], and there we found > it desirable [5] to make it easier to depend on just the core Tika > classes without all the parser dependencies. > > To make this happen, I'd split Tika into following component libraries: > > * tika-core - core parts of Tika; everything but cli, gui, and the > parser.* packages > * tika-parsers - format-specific parser classes; with dependencies to > external libraries > * tika-app - depends on all of the above; adds cli and gui; standalone > jar packaging > > We could (should?) further split the tika-parsers component into > smaller pieces based on the external dependencies used to allow > finer-grained control over what parser libraries get included in a > specific downstream package or deployment. > > WDYT? If there are no objections, I'd like to target this for the Tika > 0.4 release. > > [1] http://markmail.org/message/n64zb3cawlm4ng3k > [2] http://markmail.org/message/ji3xabugnt6wlwdh > [3] http://markmail.org/message/2sd6d5ajhpqhcwcf > [4] https://issues.apache.org/jira/browse/JCR-1878 > [5] http://markmail.org/message/cf6bj7qv7fyyxezu > > BR, > > Jukka Zitting > -- AJ Chen, PhD Co-Chair, Semantic Web SIG, sdforum.org Technical Architect, healthline.com http://web2express.org Palo Alto, CA