Hi Markus,

I assume you're trying to get at script text that's inside of the <head> 
element, yes?

And that's why the bodyLevel > 0 check is stripping it out?

I think there's a pretty strong assumption that only text inside of the <body> 
should be returned, so changing that bodyLevel check seems too heavy-handed.

Normally when I need to get at unusual content, I'll wind up by-passing Tika 
and running the HTML through TagSoup, then using Dom4J or equivalent to 
manipulate the result.

If you think your use case is actually common enough for a more general 
solution, then one idea is to support a new flag for "include content from 
header" in the parse context. I'm not sure how many other formats would have a 
similar issue, though.

-- Ken

On Oct 9, 2013, at 2:55am, Markus Jelsma wrote:

> Allright, i removed the HTML_SCHEMA stuff and inside the HtmlHandler, added 
> an exception for script in startElement:
> 
>            } else if ("SCRIPT".equals(name)) {
>                startElementWithSafeAttributes("script", atts);
>            }
> 
> This goes well, the element is reported but the characters. To do so i 
> removed the bodyLevel check from characters():
> 
>        if (bodyLevel > 0 && discardLevel == 0) {
>            super.characters(ch, start, length);
>        }
> 
> etc etc. This obviously breaks some unit tests:
> 
>  testElementOrdering(org.apache.tika.parser.html.HtmlParserTest)
>  testBrokenFrameset(org.apache.tika.parser.html.HtmlParserTest)
>  testBoilerplateDelegation(org.apache.tika.parser.html.HtmlParserTest)
>  testLinkHrefResolution(org.apache.tika.parser.html.HtmlParserTest)
>  testNewlineAndIndent(org.apache.tika.parser.html.HtmlParserTest)
> 
> Now, this is clearly not the approach to do this. I assume the best thing is 
> to treat the script similar to bodyLevel and titleLevel? Add some scriptLevel 
> move on if we're a script? 
> 
> -----Original message-----
>> From:Markus Jelsma <[email protected]>
>> Sent: Wednesday 9th October 2013 11:25
>> To: [email protected]
>> Subject: Script element not reported in custom handler
>> 
>> Hi,
>> 
>> I'm building a new ContentHandler that needs to do some work on script 
>> elements as well. But they are not reported in my startElement method. The 
>> context has the IdentityHtmlMapper set and script does not get discarded in 
>> Tika's own HtmlHandler. Instead, the script element is reported in 
>> HtmlHandler but not in my custom handler.
>> 
>> The confusing thing is that i am able to get it in my handler when adding 
>> the script element to TagSoup inside HtmlParser's constructor:
>>        HTML_SCHEMA.elementType("script", HTMLSchema.M_EMPTY, 65535, 0);
>> 
>> Without this, script and it's characters are only reported inside 
>> HtmlHandler, never in custom handlers.
>> 
>> Am must be doing something wrong here, any hints?
>> 
>> Thanks,
>> Markus
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to