Hi, On Tue, Apr 9, 2013 at 3:19 PM, Jason Tesser <[email protected]> wrote: > I would rather not alter Tika code. So that brings me to option 1. I don't > really understand how IdentityHtmlMapper helps. > http://tika.apache.org/1.2/api/org/apache/tika/parser/html/IdentityHtmlMapper.html > The class doesn't seem t expose methods to do anything different then the > DefaultHTMLMapper. > > Can you give me just a little more detail here?
The IdentityHtmlMapper makes Tika pass the parsed HTML as-is to the specified SAX ContentHandler, so you'll get also the <div class="donotparse"> events that would otherwise be swallowed by the DefaultHtmlMapper strategy. You can write a custom ContentHandler class that detects the "donotparse" attributes and skips all content within such elements. BR, Jukka Zitting
