Allright, i removed the HTML_SCHEMA stuff and inside the HtmlHandler, added an
exception for script in startElement:
} else if ("SCRIPT".equals(name)) {
startElementWithSafeAttributes("script", atts);
}
This goes well, the element is reported but the characters. To do so i removed
the bodyLevel check from characters():
if (bodyLevel > 0 && discardLevel == 0) {
super.characters(ch, start, length);
}
etc etc. This obviously breaks some unit tests:
testElementOrdering(org.apache.tika.parser.html.HtmlParserTest)
testBrokenFrameset(org.apache.tika.parser.html.HtmlParserTest)
testBoilerplateDelegation(org.apache.tika.parser.html.HtmlParserTest)
testLinkHrefResolution(org.apache.tika.parser.html.HtmlParserTest)
testNewlineAndIndent(org.apache.tika.parser.html.HtmlParserTest)
Now, this is clearly not the approach to do this. I assume the best thing is to
treat the script similar to bodyLevel and titleLevel? Add some scriptLevel move
on if we're a script?
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Wednesday 9th October 2013 11:25
> To: [email protected]
> Subject: Script element not reported in custom handler
>
> Hi,
>
> I'm building a new ContentHandler that needs to do some work on script
> elements as well. But they are not reported in my startElement method. The
> context has the IdentityHtmlMapper set and script does not get discarded in
> Tika's own HtmlHandler. Instead, the script element is reported in
> HtmlHandler but not in my custom handler.
>
> The confusing thing is that i am able to get it in my handler when adding the
> script element to TagSoup inside HtmlParser's constructor:
> HTML_SCHEMA.elementType("script", HTMLSchema.M_EMPTY, 65535, 0);
>
> Without this, script and it's characters are only reported inside
> HtmlHandler, never in custom handlers.
>
> Am must be doing something wrong here, any hints?
>
> Thanks,
> Markus
>