okay,. maybe you'd better devise the code ? i can only copy-paste from googled sources without real Java knowledge and real ability to test.
So even if i do something - it still would have to be reviewed and maybe even would not compile. http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/ Here i can see how to create DOM, yet it would be overkill, SAX is proper better approach here. But can SAX be run over HTML not XML ? java-sources.net suggest to use hotsax.sf.net, but it probably lacks auto-detection. another HTML SAX is JTagSoup, it also lacks auto-detection yet suggests looking at jchardet.sourceforge.net For what i can see, OpenOffice does not offer UTF-16 or such exports, so we have to choose between UTF-8, UTF-7 and single-byte encodings... That should replace hardcoded " htmlReader = new InputStreamReader(htmlStream, "UTF-8");" at https://github.com/xwiki/xwiki-platform/blob/master/xwiki-platform-core/xwiki-platform-office/xwiki-platform-office-importer/src/main/java/org/xwiki/officeimporter/internal/builder/DefaultXHTMLOfficeDocumentBuilder.java We maybe can assume any charset initially, for we need only Latin1 tags and values. Yet... Some tag parameters values might be non-Latin and if tags order would be different, they might come up before the encoding tag... Like in <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML> <HEAD> <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8"> <TITLE></TITLE> <META NAME="GENERATOR" CONTENT="OpenOffice.org 3.4 (Win32)"> <META NAME="AUTHOR" CONTENT="Тестовый менеджер"> <META NAME="CREATED" CONTENT="20120525;11540000"> <META NAME="CHANGEDBY" CONTENT="Тестовый менеджер"> Here u can see that charset is specified above all the rest. If we can assume that as a traditional behaviour, then we can even just offset few bytes from beginning and get directly to '=utf-8"' part :-) -- View this message in context: http://xwiki.475771.n2.nabble.com/which-HTML-parsing-libs-are-already-using-shiipped-with-XWiki-tp7580136p7580140.html Sent from the XWiki- Users mailing list archive at Nabble.com. _______________________________________________ users mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/users
