OK I see. 2 Followup questions then
1. I am concerned that I will get things I don't want. Meaning the HTMLMapper looks good. Currently my code just does this http://pastebin.com/Hnucgc30 I do NOT set a Mapper. I assume the HTML mapper gets picked up as I do set the Mimetype Parser parser = (mimeType ==null) ? getParser(binFile) : getParser(binFile, mimeType); Will using the IdentityMapper alter much of what is currently happening? 2. I use the BodyContentHandler. I looked up his source code. He isn't doing much but allowing the ContentHandlerDecorator to do most of the work which just calls methods on the handler. I assume what I want is the startElement method or something. CAN you confirm for me? Thank you for your help. This is for a product that has people already using it and I want to ensure that we don't break things currently working On Tue, Apr 9, 2013 at 8:25 AM, Jukka Zitting <[email protected]>wrote: > Hi, > > On Tue, Apr 9, 2013 at 3:19 PM, Jason Tesser <[email protected]> > wrote: > > I would rather not alter Tika code. So that brings me to option 1. I > don't > > really understand how IdentityHtmlMapper helps. > > > http://tika.apache.org/1.2/api/org/apache/tika/parser/html/IdentityHtmlMapper.html > > The class doesn't seem t expose methods to do anything different then the > > DefaultHTMLMapper. > > > > Can you give me just a little more detail here? > > The IdentityHtmlMapper makes Tika pass the parsed HTML as-is to the > specified SAX ContentHandler, so you'll get also the <div > class="donotparse"> events that would otherwise be swallowed by the > DefaultHtmlMapper strategy. You can write a custom ContentHandler > class that detects the "donotparse" attributes and skips all content > within such elements. > > BR, > > Jukka Zitting >
