HTML cleaner tells to have auto-detection in one of its methods http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/HtmlCleaner.html#clean(java.net.URL)
.... Okay, this probably might be copy-pasted almost non-modified (if HtmlCleaner's 3-clause BSD license allows it, http://htmlcleaner.sourceforge.net/license.php) Potential extension might be the loop, if there are multiple charset declarations (i saw such malformed HTMLs in the wild, though i have doubts OpenOffice would ever do such a thing, but who knows what HTML importer might get reused for later?), breaking out on 1st `supported` charset. Or just copy-paste like that, to return 1st match and no more guessing... .... org/htmlcleaner/Utils.java public static String getCharsetFromContent(URL url) throws IOException { InputStream stream = url.openStream(); byte chunk[] = new byte[2048]; int bytesRead = stream.read(chunk); if (bytesRead > 0) { String startContent = new String(chunk); String pattern = "\\<meta\\s*http-equiv=[\\\"\\']content-type[\\\"\\']\\s*content\\s*=\\s*[\"']text/html\\s*;\\s*charset=([a-z\\d\\-]*)[\\\"\\'\\>]"; Matcher matcher = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE).matcher(startContent); if (matcher.find()) { String charset = matcher.group(1); if (Charset.isSupported(charset)) { return charset; } } } return null; } ----------------- Another approach might be to use HTML parser. http://htmlparser.sourceforge.net/faq.html#encodingchangeexception This sounds like the target, the parser able to made some assumptions of charset and re-scan if proven wrong. -- View this message in context: http://xwiki.475771.n2.nabble.com/which-HTML-parsing-libs-are-already-using-shiipped-with-XWiki-tp7580136p7580141.html Sent from the XWiki- Users mailing list archive at Nabble.com. _______________________________________________ users mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/users
