On Apr 8, 2009, at 11:10 AM, AJ Chen wrote:
A different question: does tika plan to provide function for
scraping web
page? tika html parser provides everything on html page. for some
applications such as search, it's required to exclude sections
including
advertising, menu, footer, etc. it would be extremely useful to have
scraping capability in tika. Has anybody developed web page scraping
code on
top of tika?
Well a webpage is already parsable HTML so I don't know exactly why
Tika would be the relevant thing to use here. Excluding certain
sections of a page is an application specific task. To turn your
example on its head, perhaps you want to read only the advertisements
for some sort of business/marketing reason.
--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/