I would rather not alter Tika code. So that brings me to option 1.  I don't
really understand how IdentityHtmlMapper helps.
http://tika.apache.org/1.2/api/org/apache/tika/parser/html/IdentityHtmlMapper.html
The class doesn't seem t expose methods to do anything different then the
DefaultHTMLMapper.

Can you give me just a little more detail here?


On Tue, Apr 9, 2013 at 12:49 AM, Jukka Zitting <[email protected]>wrote:

> Hi,
>
> On Mon, Apr 8, 2013 at 9:32 PM, Jason Tesser <[email protected]>
> wrote:
> > What is the right way to do this?
>
> I see two options:
>
> 1) Use the IdentityHtmlMapper strategy to have Tika pass you all HTML
> elements as-is. Then you can explicitly skip selected elements in the
> SAX ContentHandler you pass in to the parser. Something like this:
>
>     ParseContext context = new ParseContext();
>     context.set(HtmlMapper.class, new IdentityHtmlMapper());
>     parser.parse(..., context);
>
> 2) Extend the HtmlMapper interface to include also attributes in the
> isDiscardElement() method. Then you can pass a custom mapper class
> that implements the class="donotparse" strategy you describe. This
> approach requires changes in Tika, so you might want to consider
> submitting a patch of your (ideally backwards-compatible) changes.
>
> BR,
>
> Jukka Zitting
>

Reply via email to