Hi,

On Mon, Apr 8, 2013 at 9:32 PM, Jason Tesser <[email protected]> wrote:
> What is the right way to do this?

I see two options:

1) Use the IdentityHtmlMapper strategy to have Tika pass you all HTML
elements as-is. Then you can explicitly skip selected elements in the
SAX ContentHandler you pass in to the parser. Something like this:

    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, new IdentityHtmlMapper());
    parser.parse(..., context);

2) Extend the HtmlMapper interface to include also attributes in the
isDiscardElement() method. Then you can pass a custom mapper class
that implements the class="donotparse" strategy you describe. This
approach requires changes in Tika, so you might want to consider
submitting a patch of your (ideally backwards-compatible) changes.

BR,

Jukka Zitting

Reply via email to