Hello list,

The Tika BodyContentHandler class handles the contents inside the <body>....</body> of an HTML document. Now I need to handle the content inside a <div id="special"> ....</div> of an HTML document.

I got stuck trying to check <div> attribute "id".

The problem is that attributes of elements are always empty inside method ContentHandler.StartElement, because they got lost in a previously called method:
HtmlHandler.startElement
...
    if (safe != null) {
xhtml.startElement(safe); // element attributes not passed in

Below you'll find the steps I took so far. Is there a better way I should do this?


Thanks,

Anne



=======================================
Steps I took

First of all, you need to trigger a "ContentHandler.StartElement" event for the the <div> element:

// override method HtmlMapper.mapSafeElement
public class DivHtmlMapper  extends DefaultHtmlMapper  {
    @Override
    public String mapSafeElement(String name) {
            if ("DIV".equals(name)) return "div";
            return super.mapSafeElement(name);
        }
}

// add the override to the parser context
ParseContext context = new ParseContext();
context.set(HtmlMapper.class, new DivHtmlMapper());

Then create an override for ContentHandler methods "StartElement:", "EndElement:" and "characters" to catch the trigger for the <div> element

class DivContentHandler extends BodyContentHandler
{
override methods StartElement, EndElement and characters ...
}

Then pass the context and the handler into the parser:

autoParser.parse(stream, divContentHandler, metadata, context);

This results in the parser executing method:
void HtmlHandler.startElement(String uri, String local, String name, Attributes atts)
{
 ....
    String safe = mapper.mapSafeElement(name); //overridden above
    if (safe != null) {
                // div element is now 'safe'
                xhtml.startElement(safe);

The element attributes - available in variable 'atts' - are not passed on here, are they lost?


Reply via email to