On Apr 8, 2009, at 11:10 AM, AJ Chen wrote:

A different question: does tika plan to provide function for scraping web
page? tika html parser provides everything on html page. for some
applications such as search, it's required to exclude sections including
advertising, menu, footer, etc.  it would be extremely useful to have
scraping capability in tika. Has anybody developed web page scraping code on
top of tika?


Well a webpage is already parsable HTML so I don't know exactly why Tika would be the relevant thing to use here. Excluding certain sections of a page is an application specific task. To turn your example on its head, perhaps you want to read only the advertisements for some sort of business/marketing reason.

--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/


Reply via email to