[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2015-04-19 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=76021

--- Comment #16 from Rev. Bob b...@thehandbasket.com ---
(In reply to Tomaz Vajngerl from comment #5)
 Heh - it's even a bigger mess when you add bold, italics and underline into
 the mix.

Something tells me this is related to the behavior I describe in bug 89069,
especially where bold and italic are treated differently than the other inline
formatting options. I was specifically looking at start-of-line behavior, but
there may well be more to it...

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-18 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #15 from Julien Nabet serval2...@yahoo.fr ---
Patrick: Oups, you're right of course! :-)

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-17 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #12 from Patrick Goetz pgo...@mail.utexas.edu ---
Intellectual curiosity leads me to add that I'd love for the person who wrote
the Export to xhmtl code to explain why they went with a purely CSS
class-based approach; especially since the Google Docs people (who I know have
plenty of resources) did the same thing.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-17 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #13 from Julien Nabet serval2...@yahoo.fr ---
(In reply to comment #12)
 Intellectual curiosity leads me to add that I'd love for the person who
 wrote the Export to xhmtl code to explain why they went with a purely CSS
 class-based approach; especially since the Google Docs people (who I know
 have plenty of resources) did the same thing.

Patrick: if it's ooo2wordml_text.xsl which does the job, it might be explained
like this:
when we look at the history of this file (see
http://opengrok.libreoffice.org/history/core/filter/source/xslt/export/wordml/ooo2wordml_text.xsl),
we can see it's been created in 2004 and, if you leave the license changes, the
last change was in March 2005. (9 years ago!)

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-17 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #14 from Patrick Goetz pgo...@mail.utexas.edu ---
ooo2wordml_text.xsl sounds like an XSL script which converts ODF to OOXML --
surely this woudn't be the same XSL used to export to xhtml?

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-15 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #10 from Patrick Goetz pgo...@mail.utexas.edu ---
Created attachment 95845
  -- https://bugs.freedesktop.org/attachment.cgi?id=95845action=edit
.docx file used for Export to xhtml example discussed in the comment.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-15 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #11 from Patrick Goetz pgo...@mail.utexas.edu ---
 If you want a valid XML document export it as XHTML, which is actually using 
 XML as a base.

The problem with this is that the xhtml I get when I use Export to xhtml is,
in my opinion, quite bizarre (however, similar to what you get with Publish to
the Web using Google Docs).  Using the attached .docx file as a starting
point, this is what I get when I export to xhtml (snippet of file):

p class=P1span class=T1Complainant/spanspan
class=apple-converted-spacespan class=T2 /span/spanspan
class=T2shall mean (a)/spanspan class=apple-converted-spacespan
class=T2 /span/spanspan class=T3the/spanspan
class=apple-converted-spacespan class=T2 /span/spanspan
class=T4any/spanspan class=apple-converted-spacespan class=T2Â
/span/spanspan class=T2person or persons from whom the Intake Officer
receives information concerning an Offense/spanspan
class=apple-converted-spacespan class=T2 /span/spanspan
class=T4and who, upon consent of that person(s), is designated a Complainant
by the Intake Officer/spanspan class=apple-converted-spacespan
class=T2 /span/spanspan class=T2or (b) any Injured Person
designated by the Bishop Diocesan who in the Bishop Diocesan’s discretion,
should be afforded the status of a Complainant, provided, however, that any
Injured Person so designated may decline such designation./span/p

(Ignoring that vim on the Windows XP machine I'm using is not reading the UTF-8
characters correctly), notice that common tags such as b and i are being
inserted as classes using the span tag.  In this case, .T1 maps to single CSS
attribute:
.T1 { font-weight:bold; }

In a longer version of the same document (i.e. including more text from the
same original document) you get more complex classes:
.T1 { font-size:10pt; font-weight:bold; }
.T13 { font-style:italic; }
.T14 { font-style:italic; }
.T15 { font-style:italic; }
.T16 { font-style:italic; text-decoration:underline; }
.T17 { font-style:italic; text-decoration:underline; }
.T18 { font-style:italic; }
.T19 { font-style:italic; font-weight:bold; }
.T20 { font-style:italic; font-weight:bold; }
.T21 { font-style:italic; font-weight:bold; }
.T22 { font-style:italic; font-weight:bold; }
.T26 { padding:0in; border-style:none; }
.T27 { text-decoration:underline; }
.T28 { text-decoration:underline; padding:0in; border-style:none; }
.T29 { font-style:italic; text-decoration:underline; }

This is both unreadable and hard to parse.  Moreover, if I take exactly the
same document and add some text, then all these classes change!  Also note the
strange duplication of classes that do exactly the same thing
(.T13,.T14,.T15,.T18)

In my application, what I need to do is extract the text, preserving simple
formatting such as p, b, i, and (deprecated) strike in order to paste
this content into another xml document.  This is do-able using the exported
xhtml, but extremely onerous; since, for example, it will require at least 2
passes through a parser: first to add the simple xhtml tags I want (b, i)
that weren't included in the first place, then another pass to strip out all
the remaining classes and other xhmtl coding that I don't want.

I can't fathom why KISS isn't being applied here:  use basic xhtml tags
whenever possible in order to keep the output readable and sane. I've written a
fair amount of XML parsing code myself, so do know something about it.  I can't
help but think this is an example of incredibly lazy programming (unless I'm
missing something).

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-13 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #8 from Tomaz Vajngerl qui...@gmail.com ---
I agree that HTML export in LO is reallybad, hasn't been worked on since
Netscape was king and it probably needs rewriting to better use CSS and SVG,
not use deprecated HTML features and to use new HTML5 tags where appropriate
(easily choosing between HTML4 and HTML5). This probably will take some time..

However, if you are trying to parse HTML with a XML parser then it is your own
fault. HTML is not XML - there are subtle differences like tags are case
sensitive in XML but on HTML, no need for / if element has no body (for
example: br is valid HTML but not XML) and nesting tags is allowed in HTML.
In other words: it is recommended today to write HTML as XML but not mandated
so you can not rely on that.

If you want a valid XML document export it as XHTML, which is actually using
XML as a base.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-13 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #9 from Tomaz Vajngerl qui...@gmail.com ---
(In reply to comment #7)
 I wonder if export-xhtml and save as-html calls the same part.
 I think having read in a bug that it could be 2 different parts (one uses
 xslt file)
 
 Miklos: any idea?

Yes, export-xhtml is using XSLT and they aren't using the same code paths.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #5 from Tomaz Vajngerl qui...@gmail.com ---
Heh - it's even a bigger mess when you add bold, italics and underline into the
mix.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #6 from Patrick Goetz pgo...@mail.utexas.edu ---
I've been doing this -- in particular, coding, and working with XML/HTML -- for
a long time.  This smells of horrifically bad coding that probably needs to be
rewritten from scratch.  No sensible XML parser would start with valid XML and
end up with invalid HTML -- that doesn't make sense.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-12 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

Julien Nabet serval2...@yahoo.fr changed:

   What|Removed |Added

 CC||vmik...@collabora.co.uk

--- Comment #7 from Julien Nabet serval2...@yahoo.fr ---
I wonder if export-xhtml and save as-html calls the same part.
I think having read in a bug that it could be 2 different parts (one uses xslt
file)

Miklos: any idea?

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-11 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #1 from Urmas davian...@gmail.com ---
HTML is not XML and therefore doesn't require nested tags or XML document
structure.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-11 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #2 from Patrick Goetz pgo...@mail.utexas.edu ---
 HTML is not XML and therefore doesn't require nested tags or XML document 
 structure.

While this might very well have been true in 1998, all modern versions of HTML
are also valid XML with DTD's and Doctypes.  In any case, users expect to get
valid output, and often the reason someone is doing Save as HTML in the first
place is the document is going to be parsed.  It makes no sense to start out
with a document that must be valid xml and end up with invalid HTML

This is quite embarrassing.  I've been recommending that people upgrade to
Libre Office from MS Office, but in this case at least Microsoft is putting out
valid HTML.  I don't understand what happened, I don't recall seeing this with
previous versions of Open Office.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-11 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

--- Comment #3 from Patrick Goetz pgo...@mail.utexas.edu ---
I checked Google Docs as well, converting the same document to HTML and
checking to see if the tag structure is xml-valid.  While the HTML output from
Google Docs can best be described as bizarre (every possible text formatting is
set up as a class and applied using span class=), the file is nevertheless
valid xml.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs


[Libreoffice-bugs] [Bug 76021] FORMATTING: Libre Office Writer: save As HTML results in interlaced strike and span tags

2014-03-11 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=76021

Julien Nabet serval2...@yahoo.fr changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 CC||serval2...@yahoo.fr
 Ever confirmed|0   |1

--- Comment #4 from Julien Nabet serval2...@yahoo.fr ---
On pc Debian x86-64 with master sources updated today, I can reproduce this.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs