I would rather not alter Tika code. So that brings me to option 1. I don't really understand how IdentityHtmlMapper helps. http://tika.apache.org/1.2/api/org/apache/tika/parser/html/IdentityHtmlMapper.html The class doesn't seem t expose methods to do anything different then the DefaultHTMLMapper.
Can you give me just a little more detail here? On Tue, Apr 9, 2013 at 12:49 AM, Jukka Zitting <[email protected]>wrote: > Hi, > > On Mon, Apr 8, 2013 at 9:32 PM, Jason Tesser <[email protected]> > wrote: > > What is the right way to do this? > > I see two options: > > 1) Use the IdentityHtmlMapper strategy to have Tika pass you all HTML > elements as-is. Then you can explicitly skip selected elements in the > SAX ContentHandler you pass in to the parser. Something like this: > > ParseContext context = new ParseContext(); > context.set(HtmlMapper.class, new IdentityHtmlMapper()); > parser.parse(..., context); > > 2) Extend the HtmlMapper interface to include also attributes in the > isDiscardElement() method. Then you can pass a custom mapper class > that implements the class="donotparse" strategy you describe. This > approach requires changes in Tika, so you might want to consider > submitting a patch of your (ideally backwards-compatible) changes. > > BR, > > Jukka Zitting >
