Re: [xwiki-users] which HTML parsing libs are already using/shiipped with XWiki ?

Arioch Wed, 04 Jul 2012 04:13:42 -0700

okay,. maybe you'd better devise the code ?

i can only copy-paste from googled sources without real Java knowledge and
real ability to test.


So even if i do something - it still would have to be reviewed and maybe
even would not compile.

http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
Here i can see how to create DOM, yet it would be overkill, SAX is proper
better approach here.
But can SAX be run over HTML not XML ?

java-sources.net suggest to use hotsax.sf.net, but it probably lacks
auto-detection.
another HTML SAX is JTagSoup, it also lacks auto-detection yet suggests
looking at jchardet.sourceforge.net

For what i can see, OpenOffice does not offer UTF-16 or such exports, so we
have to choose between UTF-8, UTF-7 and single-byte encodings...

That should replace hardcoded "            htmlReader = new
InputStreamReader(htmlStream, "UTF-8");"
at
https://github.com/xwiki/xwiki-platform/blob/master/xwiki-platform-core/xwiki-platform-office/xwiki-platform-office-importer/src/main/java/org/xwiki/officeimporter/internal/builder/DefaultXHTMLOfficeDocumentBuilder.java

We maybe can assume any charset initially, for we need only Latin1 tags and
values.
Yet... Some tag parameters values might be non-Latin and if tags order would
be different, they might come up before the encoding tag...

Like in
<!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 4.0 Transitional//EN&quot;>
<HTML>
<HEAD>
        <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
        <TITLE></TITLE>
        <META NAME="GENERATOR" CONTENT="OpenOffice.org 3.4  (Win32)">
        <META NAME="AUTHOR" CONTENT="РўРµСЃС‚РѕРІС‹Р№ РјРµРЅРµРґР¶РµСЂ">
        <META NAME="CREATED" CONTENT="20120525;11540000">
        <META NAME="CHANGEDBY" CONTENT="РўРµСЃС‚РѕРІС‹Р№ РјРµРЅРµРґР¶РµСЂ">
        

Here u can see that charset is specified above all the rest.
If we can assume that as a traditional behaviour, then we can even just
offset few bytes from beginning and get directly to '=utf-8"' part :-)




--
View this message in context: 
http://xwiki.475771.n2.nabble.com/which-HTML-parsing-libs-are-already-using-shiipped-with-XWiki-tp7580136p7580140.html
Sent from the XWiki- Users mailing list archive at Nabble.com.
_______________________________________________
users mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/users

Re: [xwiki-users] which HTML parsing libs are already using/shiipped with XWiki ?

Reply via email to