Allright, i removed the HTML_SCHEMA stuff and inside the HtmlHandler, added an 
exception for script in startElement:

            } else if ("SCRIPT".equals(name)) {
                startElementWithSafeAttributes("script", atts);
            }

This goes well, the element is reported but the characters. To do so i removed 
the bodyLevel check from characters():

        if (bodyLevel > 0 && discardLevel == 0) {
            super.characters(ch, start, length);
        }

etc etc. This obviously breaks some unit tests:

  testElementOrdering(org.apache.tika.parser.html.HtmlParserTest)
  testBrokenFrameset(org.apache.tika.parser.html.HtmlParserTest)
  testBoilerplateDelegation(org.apache.tika.parser.html.HtmlParserTest)
  testLinkHrefResolution(org.apache.tika.parser.html.HtmlParserTest)
  testNewlineAndIndent(org.apache.tika.parser.html.HtmlParserTest)

Now, this is clearly not the approach to do this. I assume the best thing is to 
treat the script similar to bodyLevel and titleLevel? Add some scriptLevel move 
on if we're a script? 
 
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Wednesday 9th October 2013 11:25
> To: [email protected]
> Subject: Script element not reported in custom handler
> 
> Hi,
> 
> I'm building a new ContentHandler that needs to do some work on script 
> elements as well. But they are not reported in my startElement method. The 
> context has the IdentityHtmlMapper set and script does not get discarded in 
> Tika's own HtmlHandler. Instead, the script element is reported in 
> HtmlHandler but not in my custom handler.
> 
> The confusing thing is that i am able to get it in my handler when adding the 
> script element to TagSoup inside HtmlParser's constructor:
>         HTML_SCHEMA.elementType("script", HTMLSchema.M_EMPTY, 65535, 0);
> 
> Without this, script and it's characters are only reported inside 
> HtmlHandler, never in custom handlers.
> 
> Am must be doing something wrong here, any hints?
> 
> Thanks,
> Markus
> 

Reply via email to