Hi,

On Tue, Apr 9, 2013 at 3:19 PM, Jason Tesser <[email protected]> wrote:
> I would rather not alter Tika code. So that brings me to option 1.  I don't
> really understand how IdentityHtmlMapper helps.
> http://tika.apache.org/1.2/api/org/apache/tika/parser/html/IdentityHtmlMapper.html
> The class doesn't seem t expose methods to do anything different then the
> DefaultHTMLMapper.
>
> Can you give me just a little more detail here?

The IdentityHtmlMapper makes Tika pass the parsed HTML as-is to the
specified SAX ContentHandler, so you'll get also the <div
class="donotparse"> events that would otherwise be swallowed by the
DefaultHtmlMapper strategy. You can write a custom ContentHandler
class that detects the "donotparse" attributes and skips all content
within such elements.

BR,

Jukka Zitting

Reply via email to