Hi,
On Mon, Apr 8, 2013 at 9:32 PM, Jason Tesser <[email protected]> wrote:
> What is the right way to do this?
I see two options:
1) Use the IdentityHtmlMapper strategy to have Tika pass you all HTML
elements as-is. Then you can explicitly skip selected elements in the
SAX ContentHandler you pass in to the parser. Something like this:
ParseContext context = new ParseContext();
context.set(HtmlMapper.class, new IdentityHtmlMapper());
parser.parse(..., context);
2) Extend the HtmlMapper interface to include also attributes in the
isDiscardElement() method. Then you can pass a custom mapper class
that implements the class="donotparse" strategy you describe. This
approach requires changes in Tika, so you might want to consider
submitting a patch of your (ideally backwards-compatible) changes.
BR,
Jukka Zitting