Hello list,
The Tika BodyContentHandler class handles the contents inside the
<body>....</body> of an HTML document.
Now I need to handle the content inside a <div id="special"> ....</div>
of an HTML document.
I got stuck trying to check <div> attribute "id".
The problem is that attributes of elements are always empty inside
method ContentHandler.StartElement, because they got lost in a
previously called method:
HtmlHandler.startElement
...
if (safe != null) {
xhtml.startElement(safe); // element attributes not
passed in
Below you'll find the steps I took so far. Is there a better way I
should do this?
Thanks,
Anne
=======================================
Steps I took
First of all, you need to trigger a "ContentHandler.StartElement" event
for the the <div> element:
// override method HtmlMapper.mapSafeElement
public class DivHtmlMapper extends DefaultHtmlMapper {
@Override
public String mapSafeElement(String name) {
if ("DIV".equals(name)) return "div";
return super.mapSafeElement(name);
}
}
// add the override to the parser context
ParseContext context = new ParseContext();
context.set(HtmlMapper.class, new DivHtmlMapper());
Then create an override for ContentHandler methods "StartElement:",
"EndElement:" and "characters" to catch the trigger for the <div> element
class DivContentHandler extends BodyContentHandler
{
override methods StartElement, EndElement and characters ...
}
Then pass the context and the handler into the parser:
autoParser.parse(stream, divContentHandler, metadata, context);
This results in the parser executing method:
void HtmlHandler.startElement(String uri, String local, String name,
Attributes atts)
{
....
String safe = mapper.mapSafeElement(name); //overridden above
if (safe != null) {
// div element is now 'safe'
xhtml.startElement(safe);
The element attributes - available in variable 'atts' - are not passed
on here, are they lost?