Hi,

On Sun, Dec 21, 2008 at 9:40 PM, Manuel Fernández Sánchez de la Blanca
<manolowa...@gmail.com> wrote:
> I'm building a web crawler and I'd like to know the type of content that can
> be extracted from an HTML document.

Tika uses some heuristics to normalize the HTML content and filter out
things like script and style elements that do not contain text content
visible to the end user. Otherwise the extracted content is pretty
much equivalent to the input HTML.

> For example, it could be possible to get the list of anchors (<a> tags)?

Yes, you can get that information by listening to the XHTML SAX events
produced by Tika.

BR,

Jukka Zitting
  • HTML parser Manuel Fernández Sánchez de la Blanca
    • Re: HTML parser Jukka Zitting
    • html parser reinhard schwab

Reply via email to