Re: Splitting Tika to separate modules

Jonathan Koren Wed, 08 Apr 2009 11:57:42 -0700


On Apr 8, 2009, at 11:10 AM, AJ Chen wrote:

A different question: does tika plan to provide function forscraping web
page? tika html parser provides everything on html page. for some
applications such as search, it's required to exclude sectionsincluding
advertising, menu, footer, etc.  it would be extremely useful to have
scraping capability in tika. Has anybody developed web page scrapingcode on
top of tika?

Well a webpage is already parsable HTML so I don't know exactly whyTika would be the relevant thing to use here. Excluding certainsections of a page is an application specific task. To turn yourexample on its head, perhaps you want to read only the advertisementsfor some sort of business/marketing reason.


--
Jonathan Koren
[email protected]
http://www.soe.ucsc.edu/~jonathan/

Re: Splitting Tika to separate modules

Reply via email to