HtmlParser removal in 3.x

David Pilato Tue, 25 Mar 2025 14:00:10 -0700

Hey team

The page https://tika.apache.org/3.1.0/formats.html#HyperText_Markup_Language 
mentions:


> The output from the HtmlParser class is guaranteed to be well-formed and 
> valid XHTML, and various heuristics are used to prevent things like inline 
> scripts from cluttering the extracted text content.

But HtmlParser links to a non existing class: 
https://tika.apache.org/3.1.0/api/org/apache/tika/parser/html/HtmlParser.html
Should it be 
https://tika.apache.org/3.1.0/api/org/apache/tika/parser/html/JSoupParser.html 
instead?



David Pilato
da...@pilato.fr
06 13 03 08 41

HtmlParser removal in 3.x

Reply via email to