Re: [docbook-apps] change default HTML encoding to UTF-8
Hi Leif, Thanks for taking the time to look into this in more detail. I have some responses below that I think will clarify the situation. Bob Stayton Sagehill Enterprises b...@sagehill.net On 8/15/2017 6:44 AM, Leif Halvard Silli wrote: Hi Bob. Do the stylesheets output both html 4, html 5, xhtml and xhtml5? Or did you conflate html 4 and html 5? See more below. The DocBook distribution has these stylesheets: html - outputs HTML 4 xhtml - outputs XHTML 1.0 xhtml-1_1 - outputs XHTML 1.1 (mainly used for EPUB 2) xhtml5 - outputs polyglot HTML 5 There is no stylesheet that outputs HTML 5 that is not serialized as XML. Here is the description of polyglot HTML 5 from Wikipedia: "Polyglot HTML is HTML that has been written to conform to both the HTML and XHTML specifications.[1] A polyglot document can therefore be parsed as either HTML (which is SGML-compatible) or XML, and will produce the same DOM structure either way. For example, in order for an HTML5 document to meet these criteria, the two requirements are that it must have an HTML5 doctype, and be written in well-formed XHTML.[2] The same document can then be served as either HTML or XHTML, depending on browser support and MIME type." I named the directory "xhtml5" to indicate that the output is parsable as XML. Those stylesheets output the DOCTYPE declaration expected of HTML 5 and the XHTML namespace declaration expected of XHTML. On 14 Aug 2017, at 18:48, Bob Stayton wrote: We have a bug report suggesting that the default output encoding for the DocBook html stylesheet be changed from ISO-8859-1 to UTF-8. I agree with this bug report. Why? Well, for one thing, you - here - talk about "html", and "html" today means "html 5". HTML 5.x recommends that documents are authored using UTF-8. In the DocBook stylesheet directory name, "html" means HTML 4. The XHTML 5 stylesheet outputs UTF-8. Also, when I look at the link in the forwarded message (https://www.oxygenxml.com/forum/viewtopic.php?f=6=14812=43711#p43711), I note that the discussion thread talks about HTML 5. I am not able to see that HTML 4 is mentioned at all in that thread. I think this is the source of the confusion. I missed the subject line that said "HTML 5". Since they mentioned iso-8859-1, I assumed they were talking about the "html" stylesheets, which are the original HTML 4 output. So they were trying to get HTML 5 output but were using the "html" stylesheet. Note this only applies to the original HTML 4 output from the "html" directory. Right. Are you saying that the stylesheet also outputs HTML 5? (Note that I ask about "HTML 5" and not about xhtml or xhtml5.) The "xhtml5" directory outputs polyglot HTML 5. The "xhtml" and "xhtml5" outputs already output UTF. Right. The justification for that ought to be that XML defaults to UTF-8. Xhtml and xhtml5 are not 'html'. Well, I would say the W3C muddied that pond when they created polyglot HTML 5. The original HTML 4 standard said ISO-8859-1 was the default encoding, but that UTF-8 would be acceptable. I am not able to find such statement in the HTMl 4 specification. I looked at the one page version: https://www.w3.org/TR/html401/html40.txt I found that statement here on the W3C website: https://www.w3schools.com/html/html_charset.asp UTF-8 ”took over” as the dominant encoding on the Web long before HTML 5 became the official version of HTML. Yes, no argument there. Technically speaking ISO-8859-1 is STILL the default HTML encoding, from user agents’ perspective. It is only from an authoring perspective that HTML 5 recommends UTF-8. DocBook stylesheets is an authoring tool. THere is only one processing model for HTML, and that model is defined by the latets HTML spec. Thus it should use UTF-8. At the very least, the DocBook stylesheet should not use the HTML 4 specification as a justification for failing to output HTML 5 as UTF-8. It does not. If a user wants HTML 5 they will need to use the "xhtml5" stylesheets in the distribution, and they will get UTF-8. It isn't difficult for a user to change the output to UTF-8, but it does require a customization. The question here is whether to change the default output encoding to UTF-8. If the user has to change the output to UTF-8 in order to produce HTML 5 output, then the stylesheet does not follow HTML5’s recommendations. No, this user should have selected the "xhtml5" stylesheet if they want HTML 5 output. No amount of customization will get the "html" stylesheet to output HTML 5. The DocBook XSL development process takes great pains to maintain backwards compatibility with its installed base. The reason the "html" directory still outputs HTML 4 is for backwards compatibility. Users that have built systems that use those stylesheets won't be surprised by suddenly getting HTML 5 output. If they want HTML 5 output, they should use the "xhtml5" directory. I hope this
Re: [docbook-apps] change default HTML encoding to UTF-8
Hi Bob. Do the stylesheets output both html 4, html 5, xhtml and xhtml5? Or did you conflate html 4 and html 5? See more below. On 14 Aug 2017, at 18:48, Bob Stayton wrote: We have a bug report suggesting that the default output encoding for the DocBook html stylesheet be changed from ISO-8859-1 to UTF-8. I agree with this bug report. Why? Well, for one thing, you - here - talk about "html", and "html" today means "html 5". HTML 5.x recommends that documents are authored using UTF-8. Also, when I look at the link in the forwarded message (https://www.oxygenxml.com/forum/viewtopic.php?f=6=14812=43711#p43711), I note that the discussion thread talks about HTML 5. I am not able to see that HTML 4 is mentioned at all in that thread. Note this only applies to the original HTML 4 output from the "html" directory. Are you saying that the stylesheet also outputs HTML 5? (Note that I ask about "HTML 5" and not about xhtml or xhtml5.) The "xhtml" and "xhtml5" outputs already output UTF. The justification for that ought to be that XML defaults to UTF-8. Xhtml and xhtml5 are not 'html'. The original HTML 4 standard said ISO-8859-1 was the default encoding, but that UTF-8 would be acceptable. I am not able to find such statement in the HTMl 4 specification. I looked at the one page version: https://www.w3.org/TR/html401/html40.txt UTF-8 ”took over” as the dominant encoding on the Web long before HTML 5 became the official version of HTML. Technically speaking ISO-8859-1 is STILL the default HTML encoding, from user agents’ perspective. It is only from an authoring perspective that HTML 5 recommends UTF-8. DocBook stylesheets is an authoring tool. THere is only one processing model for HTML, and that model is defined by the latets HTML spec. Thus it should use UTF-8. At the very least, the DocBook stylesheet should not use the HTML 4 specification as a justification for failing to output HTML 5 as UTF-8. It isn't difficult for a user to change the output to UTF-8, but it does require a customization. The question here is whether to change the default output encoding to UTF-8. If the user has to change the output to UTF-8 in order to produce HTML 5 output, then the stylesheet does not follow HTML5’s recommendations. The fact that the user can produce XHTMl - and thus automatically get UTF-8 - does not alter the picture. This would change the HTML output to replace character references like
Re: [docbook-apps] change default HTML encoding to UTF-8
hi Bob, The change wouldn't be a hardship for me as I postprocess the built html to use utf-8 encoding anyway. Yet I'm resistant to change. So whatever you think best is fine with me. thanks for checking in, --Tim On Mon, Aug 14, 2017 at 12:48 PM, Bob Staytonwrote: > We have a bug report suggesting that the default output encoding for the > DocBook html stylesheet be changed from ISO-8859-1 to UTF-8. Note this > only applies to the original HTML 4 output from the "html" directory. The > "xhtml" and "xhtml5" outputs already output UTF. > > The original HTML 4 standard said ISO-8859-1 was the default encoding, but > that UTF-8 would be acceptable. It isn't difficult for a user to change > the output to UTF-8, but it does require a customization. The question > here is whether to change the default output encoding to UTF-8. > > This would change the HTML output to replace character references like >
Re: [docbook-apps] change default HTML encoding to UTF-8
No Bob, no change here, though I would benefit (on occasion) from utf-8 regards On 14 August 2017 at 17:48, Bob Staytonwrote: > We have a bug report suggesting that the default output encoding for the > DocBook html stylesheet be changed from ISO-8859-1 to UTF-8. Note this only > applies to the original HTML 4 output from the "html" directory. The "xhtml" > and "xhtml5" outputs already output UTF. > > The original HTML 4 standard said ISO-8859-1 was the default encoding, but > that UTF-8 would be acceptable. It isn't difficult for a user to change the > output to UTF-8, but it does require a customization. The question here is > whether to change the default output encoding to UTF-8. > > This would change the HTML output to replace character references like >
[docbook-apps] change default HTML encoding to UTF-8
We have a bug report suggesting that the default output encoding for the DocBook html stylesheet be changed from ISO-8859-1 to UTF-8. Note this only applies to the original HTML 4 output from the "html" directory. The "xhtml" and "xhtml5" outputs already output UTF. The original HTML 4 standard said ISO-8859-1 was the default encoding, but that UTF-8 would be acceptable. It isn't difficult for a user to change the output to UTF-8, but it does require a customization. The question here is whether to change the default output encoding to UTF-8. This would change the HTML output to replace character references like