[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.documentfoundation.org/show_bug.cgi?id=76021 --- Comment #16 from Rev. Bob b...@thehandbasket.com --- (In reply to Tomaz Vajngerl from comment #5) Heh - it's even a bigger mess when you add bold, italics and underline into the mix. Something tells me this is related to the behavior I describe in bug 89069, especially where bold and italic are treated differently than the other inline formatting options. I was specifically looking at start-of-line behavior, but there may well be more to it... -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #15 from Julien Nabet serval2...@yahoo.fr --- Patrick: Oups, you're right of course! :-) -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #12 from Patrick Goetz pgo...@mail.utexas.edu --- Intellectual curiosity leads me to add that I'd love for the person who wrote the Export to xhmtl code to explain why they went with a purely CSS class-based approach; especially since the Google Docs people (who I know have plenty of resources) did the same thing. -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #13 from Julien Nabet serval2...@yahoo.fr --- (In reply to comment #12) Intellectual curiosity leads me to add that I'd love for the person who wrote the Export to xhmtl code to explain why they went with a purely CSS class-based approach; especially since the Google Docs people (who I know have plenty of resources) did the same thing. Patrick: if it's ooo2wordml_text.xsl which does the job, it might be explained like this: when we look at the history of this file (see http://opengrok.libreoffice.org/history/core/filter/source/xslt/export/wordml/ooo2wordml_text.xsl), we can see it's been created in 2004 and, if you leave the license changes, the last change was in March 2005. (9 years ago!) -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #14 from Patrick Goetz pgo...@mail.utexas.edu --- ooo2wordml_text.xsl sounds like an XSL script which converts ODF to OOXML -- surely this woudn't be the same XSL used to export to xhtml? -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #10 from Patrick Goetz pgo...@mail.utexas.edu --- Created attachment 95845 -- https://bugs.freedesktop.org/attachment.cgi?id=95845action=edit .docx file used for Export to xhtml example discussed in the comment. -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #11 from Patrick Goetz pgo...@mail.utexas.edu --- If you want a valid XML document export it as XHTML, which is actually using XML as a base. The problem with this is that the xhtml I get when I use Export to xhtml is, in my opinion, quite bizarre (however, similar to what you get with Publish to the Web using Google Docs). Using the attached .docx file as a starting point, this is what I get when I export to xhtml (snippet of file): p class=P1span class=T1Complainant/spanspan class=apple-converted-spacespan class=T2 /span/spanspan class=T2shall mean (a)/spanspan class=apple-converted-spacespan class=T2 /span/spanspan class=T3the/spanspan class=apple-converted-spacespan class=T2 /span/spanspan class=T4any/spanspan class=apple-converted-spacespan class=T2 /span/spanspan class=T2person or persons from whom the Intake Officer receives information concerning an Offense/spanspan class=apple-converted-spacespan class=T2 /span/spanspan class=T4and who, upon consent of that person(s), is designated a Complainant by the Intake Officer/spanspan class=apple-converted-spacespan class=T2 /span/spanspan class=T2or (b) any Injured Person designated by the Bishop Diocesan who in the Bishop Diocesan’s discretion, should be afforded the status of a Complainant, provided, however, that any Injured Person so designated may decline such designation./span/p (Ignoring that vim on the Windows XP machine I'm using is not reading the UTF-8 characters correctly), notice that common tags such as b and i are being inserted as classes using the span tag. In this case, .T1 maps to single CSS attribute: .T1 { font-weight:bold; } In a longer version of the same document (i.e. including more text from the same original document) you get more complex classes: .T1 { font-size:10pt; font-weight:bold; } .T13 { font-style:italic; } .T14 { font-style:italic; } .T15 { font-style:italic; } .T16 { font-style:italic; text-decoration:underline; } .T17 { font-style:italic; text-decoration:underline; } .T18 { font-style:italic; } .T19 { font-style:italic; font-weight:bold; } .T20 { font-style:italic; font-weight:bold; } .T21 { font-style:italic; font-weight:bold; } .T22 { font-style:italic; font-weight:bold; } .T26 { padding:0in; border-style:none; } .T27 { text-decoration:underline; } .T28 { text-decoration:underline; padding:0in; border-style:none; } .T29 { font-style:italic; text-decoration:underline; } This is both unreadable and hard to parse. Moreover, if I take exactly the same document and add some text, then all these classes change! Also note the strange duplication of classes that do exactly the same thing (.T13,.T14,.T15,.T18) In my application, what I need to do is extract the text, preserving simple formatting such as p, b, i, and (deprecated) strike in order to paste this content into another xml document. This is do-able using the exported xhtml, but extremely onerous; since, for example, it will require at least 2 passes through a parser: first to add the simple xhtml tags I want (b, i) that weren't included in the first place, then another pass to strip out all the remaining classes and other xhmtl coding that I don't want. I can't fathom why KISS isn't being applied here: use basic xhtml tags whenever possible in order to keep the output readable and sane. I've written a fair amount of XML parsing code myself, so do know something about it. I can't help but think this is an example of incredibly lazy programming (unless I'm missing something). -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #8 from Tomaz Vajngerl qui...@gmail.com --- I agree that HTML export in LO is reallybad, hasn't been worked on since Netscape was king and it probably needs rewriting to better use CSS and SVG, not use deprecated HTML features and to use new HTML5 tags where appropriate (easily choosing between HTML4 and HTML5). This probably will take some time.. However, if you are trying to parse HTML with a XML parser then it is your own fault. HTML is not XML - there are subtle differences like tags are case sensitive in XML but on HTML, no need for / if element has no body (for example: br is valid HTML but not XML) and nesting tags is allowed in HTML. In other words: it is recommended today to write HTML as XML but not mandated so you can not rely on that. If you want a valid XML document export it as XHTML, which is actually using XML as a base. -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #9 from Tomaz Vajngerl qui...@gmail.com --- (In reply to comment #7) I wonder if export-xhtml and save as-html calls the same part. I think having read in a bug that it could be 2 different parts (one uses xslt file) Miklos: any idea? Yes, export-xhtml is using XSLT and they aren't using the same code paths. -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #5 from Tomaz Vajngerl qui...@gmail.com --- Heh - it's even a bigger mess when you add bold, italics and underline into the mix. -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #6 from Patrick Goetz pgo...@mail.utexas.edu --- I've been doing this -- in particular, coding, and working with XML/HTML -- for a long time. This smells of horrifically bad coding that probably needs to be rewritten from scratch. No sensible XML parser would start with valid XML and end up with invalid HTML -- that doesn't make sense. -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 Julien Nabet serval2...@yahoo.fr changed: What|Removed |Added CC||vmik...@collabora.co.uk --- Comment #7 from Julien Nabet serval2...@yahoo.fr --- I wonder if export-xhtml and save as-html calls the same part. I think having read in a bug that it could be 2 different parts (one uses xslt file) Miklos: any idea? -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #1 from Urmas davian...@gmail.com --- HTML is not XML and therefore doesn't require nested tags or XML document structure. -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #2 from Patrick Goetz pgo...@mail.utexas.edu --- HTML is not XML and therefore doesn't require nested tags or XML document structure. While this might very well have been true in 1998, all modern versions of HTML are also valid XML with DTD's and Doctypes. In any case, users expect to get valid output, and often the reason someone is doing Save as HTML in the first place is the document is going to be parsed. It makes no sense to start out with a document that must be valid xml and end up with invalid HTML This is quite embarrassing. I've been recommending that people upgrade to Libre Office from MS Office, but in this case at least Microsoft is putting out valid HTML. I don't understand what happened, I don't recall seeing this with previous versions of Open Office. -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 --- Comment #3 from Patrick Goetz pgo...@mail.utexas.edu --- I checked Google Docs as well, converting the same document to HTML and checking to see if the tag structure is xml-valid. While the HTML output from Google Docs can best be described as bizarre (every possible text formatting is set up as a class and applied using span class=), the file is nevertheless valid xml. -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs
[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags
https://bugs.freedesktop.org/show_bug.cgi?id=76021 Julien Nabet serval2...@yahoo.fr changed: What|Removed |Added Status|UNCONFIRMED |NEW CC||serval2...@yahoo.fr Ever confirmed|0 |1 --- Comment #4 from Julien Nabet serval2...@yahoo.fr --- On pc Debian x86-64 with master sources updated today, I can reproduce this. -- You are receiving this mail because: You are the assignee for the bug. ___ Libreoffice-bugs mailing list Libreoffice-bugs@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs