Hi, On Sun, Dec 21, 2008 at 9:40 PM, Manuel Fernández Sánchez de la Blanca <manolowa...@gmail.com> wrote: > I'm building a web crawler and I'd like to know the type of content that can > be extracted from an HTML document.
Tika uses some heuristics to normalize the HTML content and filter out things like script and style elements that do not contain text content visible to the end user. Otherwise the extracted content is pretty much equivalent to the input HTML. > For example, it could be possible to get the list of anchors (<a> tags)? Yes, you can get that information by listening to the XHTML SAX events produced by Tika. BR, Jukka Zitting