Hi Markus,
I assume you're trying to get at script text that's inside of the <head>
element, yes?
And that's why the bodyLevel > 0 check is stripping it out?
I think there's a pretty strong assumption that only text inside of the <body>
should be returned, so changing that bodyLevel check seems too heavy-handed.
Normally when I need to get at unusual content, I'll wind up by-passing Tika
and running the HTML through TagSoup, then using Dom4J or equivalent to
manipulate the result.
If you think your use case is actually common enough for a more general
solution, then one idea is to support a new flag for "include content from
header" in the parse context. I'm not sure how many other formats would have a
similar issue, though.
-- Ken
On Oct 9, 2013, at 2:55am, Markus Jelsma wrote:
> Allright, i removed the HTML_SCHEMA stuff and inside the HtmlHandler, added
> an exception for script in startElement:
>
> } else if ("SCRIPT".equals(name)) {
> startElementWithSafeAttributes("script", atts);
> }
>
> This goes well, the element is reported but the characters. To do so i
> removed the bodyLevel check from characters():
>
> if (bodyLevel > 0 && discardLevel == 0) {
> super.characters(ch, start, length);
> }
>
> etc etc. This obviously breaks some unit tests:
>
> testElementOrdering(org.apache.tika.parser.html.HtmlParserTest)
> testBrokenFrameset(org.apache.tika.parser.html.HtmlParserTest)
> testBoilerplateDelegation(org.apache.tika.parser.html.HtmlParserTest)
> testLinkHrefResolution(org.apache.tika.parser.html.HtmlParserTest)
> testNewlineAndIndent(org.apache.tika.parser.html.HtmlParserTest)
>
> Now, this is clearly not the approach to do this. I assume the best thing is
> to treat the script similar to bodyLevel and titleLevel? Add some scriptLevel
> move on if we're a script?
>
> -----Original message-----
>> From:Markus Jelsma <[email protected]>
>> Sent: Wednesday 9th October 2013 11:25
>> To: [email protected]
>> Subject: Script element not reported in custom handler
>>
>> Hi,
>>
>> I'm building a new ContentHandler that needs to do some work on script
>> elements as well. But they are not reported in my startElement method. The
>> context has the IdentityHtmlMapper set and script does not get discarded in
>> Tika's own HtmlHandler. Instead, the script element is reported in
>> HtmlHandler but not in my custom handler.
>>
>> The confusing thing is that i am able to get it in my handler when adding
>> the script element to TagSoup inside HtmlParser's constructor:
>> HTML_SCHEMA.elementType("script", HTMLSchema.M_EMPTY, 65535, 0);
>>
>> Without this, script and it's characters are only reported inside
>> HtmlHandler, never in custom handlers.
>>
>> Am must be doing something wrong here, any hints?
>>
>> Thanks,
>> Markus
>>
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr