Hi,

On Thu, Jan 5, 2012 at 10:29 AM, Mukhit <[email protected]> wrote:
> I try to extract span tags in html document, but unsucessfully. Tika html 
> parser
> extracts only tags like p,a,b,br,div.
> Any suggestions would be nice.

By default Tika attempts to normalize the incoming HTML document to
make it easier for client applications to consume. See the
org.apache.tika.parser.html.DefaultHtmlMapper class for the details.

You can either subclass DefaultHtmlMapper to mark also span tags as OK
to include in the XHTML output, or use the
org.apache.tika.parser.html.IdentityHtmlMapper class to disable all
normalization of the incoming HTML. Both solutions will allow span
tags to be passed to your application.

To make Tika use such an alternative HtmlMapper instance, simply pass
it as a part of the parse context, like this:

    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);

    Parser parser = ...;
    parser.parse(..., context);

BR,

Jukka Zitting

Reply via email to