HTML cleaner tells to have auto-detection in one of its methods
http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/HtmlCleaner.html#clean(java.net.URL)


....

Okay, this probably might be copy-pasted almost non-modified (if
HtmlCleaner's 3-clause BSD license allows it,
http://htmlcleaner.sourceforge.net/license.php)

Potential extension might be the loop, if there are multiple charset
declarations (i saw such malformed HTMLs in the wild, though i have doubts
OpenOffice would ever do such a thing, but who knows what HTML importer
might get reused for later?), breaking out on 1st `supported` charset.

Or just copy-paste like that, to return 1st match and no more guessing...
....

org/htmlcleaner/Utils.java

    public static String getCharsetFromContent(URL url) throws IOException {
                InputStream stream = url.openStream();
                byte chunk[] = new byte[2048];
                int bytesRead = stream.read(chunk);
                if (bytesRead > 0) {
                    String startContent = new String(chunk);
                    String pattern =
"\\<meta\\s*http-equiv=[\\\&quot;\\']content-type[\\\&quot;\\']\\s*content\\s*=\\s*[\&quot;']text/html\\s*;\\s*charset=([a-z\\d\\-]*)[\\\&quot;\\'\\>]";
                    Matcher matcher = Pattern.compile(pattern, 
Pattern.CASE_INSENSITIVE).matcher(startContent);
                    if (matcher.find()) {
                        String charset = matcher.group(1);
                        if (Charset.isSupported(charset)) {
                            return charset;
                        }
                    }
                }
        
                return null;
            }

-----------------

Another approach might be to use HTML parser.
http://htmlparser.sourceforge.net/faq.html#encodingchangeexception

This sounds like the target, the parser able to made some assumptions of
charset and re-scan if proven wrong.

--
View this message in context: 
http://xwiki.475771.n2.nabble.com/which-HTML-parsing-libs-are-already-using-shiipped-with-XWiki-tp7580136p7580141.html
Sent from the XWiki- Users mailing list archive at Nabble.com.
_______________________________________________
users mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/users

Reply via email to