OK I see.

2 Followup questions then

1. I am concerned that I will get things I don't want. Meaning the
HTMLMapper looks good. Currently my code just does this
http://pastebin.com/Hnucgc30  I do NOT set a Mapper. I assume the HTML
mapper gets picked up as I do set the Mimetype

 Parser parser = (mimeType ==null) ? getParser(binFile) :
getParser(binFile, mimeType);

Will using the IdentityMapper alter much of what is currently happening?

2. I use the BodyContentHandler.  I looked up his source code. He isn't
doing much but allowing the ContentHandlerDecorator to do most of the work
which just calls methods on the handler.

I assume what I want is the startElement method or something.

CAN you confirm for me?

Thank you for your help. This is for a product that has people already
using it and I want to ensure that we don't break things currently working


On Tue, Apr 9, 2013 at 8:25 AM, Jukka Zitting <[email protected]>wrote:

> Hi,
>
> On Tue, Apr 9, 2013 at 3:19 PM, Jason Tesser <[email protected]>
> wrote:
> > I would rather not alter Tika code. So that brings me to option 1.  I
> don't
> > really understand how IdentityHtmlMapper helps.
> >
> http://tika.apache.org/1.2/api/org/apache/tika/parser/html/IdentityHtmlMapper.html
> > The class doesn't seem t expose methods to do anything different then the
> > DefaultHTMLMapper.
> >
> > Can you give me just a little more detail here?
>
> The IdentityHtmlMapper makes Tika pass the parsed HTML as-is to the
> specified SAX ContentHandler, so you'll get also the <div
> class="donotparse"> events that would otherwise be swallowed by the
> DefaultHtmlMapper strategy. You can write a custom ContentHandler
> class that detects the "donotparse" attributes and skips all content
> within such elements.
>
> BR,
>
> Jukka Zitting
>

Reply via email to