Re: Script tag contents not always reported in ContentHandler

Markus Jelsma Thu, 30 May 2024 06:39:28 -0700

Hello Tim,

Nothing to apologize. It is the embedded type="ld+json" script containing
Microdata in which we are not interested. Tika has no problem reporting it
in some cases, but in many cases we don't get the characters. We do get the
startElement reported. A product page on ritel [1] is one example. I'll
send you the actual HTML file just in case the online version has changed
since we last downloaded it.


If we enable schema.elementType("script", HTMLSchema.M_ANY, 255, 0); we do
get the JSON blob reported. But also the order of the elements in the head
received by startElement is suddenly different.

Many thanks already!
Markus

[1] https://www.ritel.nl/samsung/galaxy-a15/

Op do 30 mei 2024 om 14:53 schreef Tim Allison <talli...@apache.org>:

> Markus,
>   I'm sorry for my delay. We're migrating to jsoup in 3.x. I realize that
> 3.x isn't out yet, but I wanted to give you a heads up.
>
>   To extract scripts in 3.x, you'd do something like this:
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/resources/org/apache/tika/parser/html/tika-config.xml
>
>   You should be able to swap in the HtmlParser for the JsoupParser in that
> config and be good to go.
>
>   Are you able to share an example html with me, even if only privately? I
> _think_ we have a unit test for script handling in 2.x and 3.x, and it
> _should_ work.
>
>       Best,
>
>                 Tim
>
> On Wed, May 29, 2024 at 9:37 AM Markus Jelsma <markus.jel...@openindex.io>
> wrote:
>
>> So i found HtmlParser.setExtractScripts(),this sounds very promising!
>> Changed the code to use HtmlParser instead of AutoDetectParser and set the
>> flag to true. Unforuntately, the script's contents were still not reported
>> in the characters method. No idea why.
>>
>> I also found TagSoup's Parser.*CDATAElementsFeature
>> <https://javadoc.io/static/org.ccil.cowan.tagsoup/tagsoup/1.2.1/org/ccil/cowan/tagsoup/Parser.html#CDATAElementsFeature>*
>> constant. Seems to be the same as:
>> http://www.ccil.org/~cowan/tagsoup/features/cdata-elementsA value of
>> "true" indicates that the parser will process the script and style
>> elements (or any elements with type='cdata' in the TSSL schema) as SGML
>> CDATA elements (that is, no markup is recognized except the matching
>> end-tag).
>>
>> Sounds promising, well, at least something to try. But how do we exactly
>> set that parameter from code or in tika-config.xml if that is better. It
>> isn't really obvious at the moment.
>>
>> Many thanks,
>> Markus
>>
>>
>>
>> Op di 28 mei 2024 om 12:19 schreef Markus Jelsma <
>> markus.jel...@openindex.io>:
>>
>>> Hello,
>>>
>>> We're using Tika to parse HTML via a custom ContentHandler. This works
>>> really well. Except that in some cases we do not get the contents of script
>>> tags in the head reported in the characters() method in the ContentHandler.
>>>
>>> We're using this code:
>>> TikaConfig tikaConfig = new
>>> TikaConfig(SAXTestCase.class.getResourceAsStream("/tika-config.xml"));
>>> Schema schema = new HTMLSchema();
>>> ParseContext context = new ParseContext();
>>> context.set(Schema.class, schema);
>>> context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);
>>> Metadata metadata = new Metadata();
>>> ReadableContentHandler handler = new ReadableContentHandler(url, config);
>>> AutoDetectParser parser = new AutoDetectParser(tikaConfig);
>>> InputStream stream = SAXTestCase.class.getResourceAsStream(path);
>>> parser.parse(stream, handler, metadata, context);
>>>
>>> If we fiddle with TagSoup's Schema we do see some bad examples suddenly
>>> report the characters of the script tag. But, as in good tradition, other
>>> stuff breaks and things like meta fields in some other HTML examples no
>>> longer get reported.
>>>
>>> schema.elementType("script", HTMLSchema.M_ANY, 255, 0);
>>>
>>> Now, i don't even know if changing the schema is a good idea, or if
>>> there is some other setting in Tika i do not know or forgot about.
>>>
>>> Anyone here having some ideas?
>>>
>>> Thanks,
>>> Markus
>>>
>>

Re: Script tag contents not always reported in ContentHandler

Reply via email to