Re: Splitting Tika to separate modules

AJ Chen Wed, 08 Apr 2009 11:11:55 -0700

Splitting to three major components will certainly help re-usability, but
too many small components may make it less convenient to use because of the
large number of jars.


A different question: does tika plan to provide function for scraping web
page? tika html parser provides everything on html page. for some
applications such as search, it's required to exclude sections including
advertising, menu, footer, etc.  it would be extremely useful to have
scraping capability in tika. Has anybody developed web page scraping code on
top of tika?

thanks,
aj

On Wed, Apr 8, 2009 at 6:58 AM, Jukka Zitting <jukka.zitt...@gmail.com>wrote:

> Hi,
>
> Revisiting a topic that we've considered already before (in at least
> [1], [2] and [3])...
>
> I'm working on integrating Tika to Jackrabbit [4], and there we found
> it desirable [5] to make it easier to depend on just the core Tika
> classes without all the parser dependencies.
>
> To make this happen, I'd split Tika into following component libraries:
>
> * tika-core - core parts of Tika; everything but cli, gui, and the
> parser.* packages
> * tika-parsers - format-specific parser classes; with dependencies to
> external libraries
> * tika-app - depends on all of the above; adds cli and gui; standalone
> jar packaging
>
> We could (should?) further split the tika-parsers component into
> smaller pieces based on the external dependencies used to allow
> finer-grained control over what parser libraries get included in a
> specific downstream package or deployment.
>
> WDYT? If there are no objections, I'd like to target this for the Tika
> 0.4 release.
>
> [1] http://markmail.org/message/n64zb3cawlm4ng3k
> [2] http://markmail.org/message/ji3xabugnt6wlwdh
> [3] http://markmail.org/message/2sd6d5ajhpqhcwcf
> [4] https://issues.apache.org/jira/browse/JCR-1878
> [5] http://markmail.org/message/cf6bj7qv7fyyxezu
>
> BR,
>
> Jukka Zitting
>



-- 
AJ Chen, PhD
Co-Chair, Semantic Web SIG, sdforum.org
Technical Architect, healthline.com
http://web2express.org
Palo Alto, CA

Re: Splitting Tika to separate modules

Reply via email to