Re: [docbook-apps] change default HTML encoding to UTF-8

2017-08-15 Thread Bob Stayton

Hi Leif,
Thanks for taking the time to look into this in more detail.  I have 
some responses below that I think will clarify the situation.


Bob Stayton
Sagehill Enterprises
b...@sagehill.net

On 8/15/2017 6:44 AM, Leif Halvard Silli wrote:
Hi Bob. Do the stylesheets output both html 4, html 5, xhtml and xhtml5? 
Or did you conflate html 4 and html 5? See more below.


The DocBook distribution has these stylesheets:

html - outputs HTML 4
xhtml - outputs XHTML 1.0
xhtml-1_1 - outputs XHTML 1.1 (mainly used for EPUB 2)
xhtml5 - outputs polyglot HTML 5

There is no stylesheet that outputs HTML 5 that is not serialized as 
XML.  Here is the description of polyglot HTML 5 from Wikipedia:


"Polyglot HTML is HTML that has been written to conform to both the HTML 
and XHTML specifications.[1] A polyglot document can therefore be parsed 
as either HTML (which is SGML-compatible) or XML, and will produce the 
same DOM structure either way. For example, in order for an HTML5 
document to meet these criteria, the two requirements are that it must 
have an HTML5 doctype, and be written in well-formed XHTML.[2] The same 
document can then be served as either HTML or XHTML, depending on 
browser support and MIME type."


I named the directory "xhtml5" to indicate that the output is parsable 
as XML.  Those stylesheets output the DOCTYPE declaration expected of 
HTML 5 and the XHTML namespace declaration expected of XHTML.



On 14 Aug 2017, at 18:48, Bob Stayton wrote:

We have a bug report suggesting that the default output encoding for 
the DocBook html stylesheet be changed from ISO-8859-1 to UTF-8.


I agree with this bug report. Why? Well, for one thing, you - here - 
talk about "html", and "html" today means "html 5". HTML 5.x recommends 
that documents are authored using UTF-8.


In the DocBook stylesheet directory name, "html" means HTML 4.  The 
XHTML 5 stylesheet outputs UTF-8.


Also, when I look at the link in the forwarded message 
(https://www.oxygenxml.com/forum/viewtopic.php?f=6=14812=43711#p43711), 
I note that the discussion thread talks about HTML 5. I am not able to 
see that HTML 4 is mentioned at all in that thread.




I think this is the source of the confusion. I missed the subject line 
that said "HTML 5". Since they

mentioned iso-8859-1, I assumed they were talking about the
"html" stylesheets, which are the original HTML 4 output.
So they were trying to get HTML 5 output but were using the "html" 
stylesheet.


Note this only applies to the original HTML 4 output from the "html" 
directory.


Right.



Are you saying that the stylesheet also outputs HTML 5? (Note that I ask 
about "HTML 5" and not about xhtml or xhtml5.)


The "xhtml5" directory outputs polyglot HTML 5.




The "xhtml" and "xhtml5" outputs already output UTF.


Right.



The justification for that ought to be that XML defaults to UTF-8. Xhtml 
and xhtml5 are not 'html'.


Well, I would say the W3C muddied that pond when they created polyglot 
HTML 5.




The original HTML 4 standard said ISO-8859-1 was the default encoding, 
but that UTF-8 would be acceptable.


I am not able to find such statement in the HTMl 4 specification. I 
looked at the one page version: https://www.w3.org/TR/html401/html40.txt


I found that statement here on the W3C website:

https://www.w3schools.com/html/html_charset.asp

UTF-8 ”took over” as the dominant encoding on the Web long before HTML 5 
became the official version of HTML.


Yes, no argument there.

Technically speaking ISO-8859-1 is STILL the default HTML encoding, from 
user agents’ perspective. It is only from an authoring perspective that 
HTML 5 recommends UTF-8.


DocBook stylesheets is an authoring tool. THere is only one processing 
model for HTML, and that model is defined by the latets HTML spec. Thus 
it should use UTF-8.


At the very least, the DocBook stylesheet should not use the HTML 4 
specification as a justification for failing to output HTML 5 as UTF-8.


It does not.  If a user wants HTML 5 they will need to use the "xhtml5" 
stylesheets in the distribution, and they will get UTF-8.


It isn't difficult for a user to change the output to UTF-8, but it 
does require a customization.  The question here is whether to change 
the default output encoding to UTF-8.


If the user has to change the output to UTF-8 in order to produce HTML 5 
output, then the stylesheet does not follow HTML5’s recommendations.


No, this user should have selected the "xhtml5" stylesheet if they want 
HTML 5 output.  No amount of customization will get the "html" 
stylesheet to output HTML 5.


The DocBook XSL development process takes great pains to maintain 
backwards compatibility with its installed base.  The reason the "html" 
directory still outputs HTML 4 is for backwards compatibility.  Users 
that have built systems that use those stylesheets won't be surprised by 
 suddenly getting HTML 5 output.  If they want HTML 5 output, they 
should use the "xhtml5" directory.


I hope this 

Re: [docbook-apps] change default HTML encoding to UTF-8

2017-08-15 Thread Leif Halvard Silli
Hi Bob. Do the stylesheets output both html 4, html 5, xhtml and xhtml5? 
Or did you conflate html 4 and html 5? See more below.


On 14 Aug 2017, at 18:48, Bob Stayton wrote:

We have a bug report suggesting that the default output encoding for 
the DocBook html stylesheet be changed from ISO-8859-1 to UTF-8.


I agree with this bug report. Why? Well, for one thing, you - here - 
talk about "html", and "html" today means "html 5". HTML 5.x recommends 
that documents are authored using UTF-8.


Also, when I look at the link in the forwarded message 
(https://www.oxygenxml.com/forum/viewtopic.php?f=6=14812=43711#p43711), 
I note that the discussion thread talks about HTML 5. I am not able to 
see that HTML 4 is mentioned at all in that thread.


Note this only applies to the original HTML 4 output from the "html" 
directory.



Are you saying that the stylesheet also outputs HTML 5? (Note that I ask 
about "HTML 5" and not about xhtml or xhtml5.)




The "xhtml" and "xhtml5" outputs already output UTF.



The justification for that ought to be that XML defaults to UTF-8. Xhtml 
and xhtml5 are not 'html'.



The original HTML 4 standard said ISO-8859-1 was the default encoding, 
but that UTF-8 would be acceptable.


I am not able to find such statement in the HTMl 4 specification. I 
looked at the one page version: https://www.w3.org/TR/html401/html40.txt


UTF-8 ”took over” as the dominant encoding on the Web long before 
HTML 5 became the official version of HTML.


Technically speaking ISO-8859-1 is STILL the default HTML encoding, from 
user agents’ perspective. It is only from an authoring perspective 
that HTML 5 recommends UTF-8.


DocBook stylesheets is an authoring tool. THere is only one processing 
model for HTML, and that model is defined by the latets HTML spec. Thus 
it should use UTF-8.


At the very least, the DocBook stylesheet should not use the HTML 4 
specification as a justification for failing to output HTML 5 as UTF-8.


It isn't difficult for a user to change the output to UTF-8, but it 
does require a customization.  The question here is whether to change 
the default output encoding to UTF-8.


If the user has to change the output to UTF-8 in order to produce HTML 5 
output, then the stylesheet does not follow HTML5’s recommendations.


The fact that the user can produce XHTMl - and thus automatically get 
UTF-8 - does not alter the picture.


This would change the HTML output to replace character references like 


Re: [docbook-apps] change default HTML encoding to UTF-8

2017-08-14 Thread Tim Arnold
hi Bob,
The change wouldn't be a hardship for me as I postprocess the built html to
use utf-8 encoding anyway.
Yet I'm resistant to change. So whatever you think best is fine with me.
thanks for checking in,
--Tim


On Mon, Aug 14, 2017 at 12:48 PM, Bob Stayton  wrote:

> We have a bug report suggesting that the default output encoding for the
> DocBook html stylesheet be changed from ISO-8859-1 to UTF-8.  Note this
> only applies to the original HTML 4 output from the "html" directory. The
> "xhtml" and "xhtml5" outputs already output UTF.
>
> The original HTML 4 standard said ISO-8859-1 was the default encoding, but
> that UTF-8 would be acceptable.  It isn't difficult for a user to change
> the output to UTF-8, but it does require a customization.  The question
> here is whether to change the default output encoding to UTF-8.
>
> This would change the HTML output to replace character references like
> 

Re: [docbook-apps] change default HTML encoding to UTF-8

2017-08-14 Thread Dave Pawson
No Bob, no change here, though I would benefit (on occasion) from utf-8

regards

On 14 August 2017 at 17:48, Bob Stayton  wrote:
> We have a bug report suggesting that the default output encoding for the
> DocBook html stylesheet be changed from ISO-8859-1 to UTF-8.  Note this only
> applies to the original HTML 4 output from the "html" directory. The "xhtml"
> and "xhtml5" outputs already output UTF.
>
> The original HTML 4 standard said ISO-8859-1 was the default encoding, but
> that UTF-8 would be acceptable.  It isn't difficult for a user to change the
> output to UTF-8, but it does require a customization.  The question here is
> whether to change the default output encoding to UTF-8.
>
> This would change the HTML output to replace character references like
> 

[docbook-apps] change default HTML encoding to UTF-8

2017-08-14 Thread Bob Stayton
We have a bug report suggesting that the default output encoding for the 
DocBook html stylesheet be changed from ISO-8859-1 to UTF-8.  Note this 
only applies to the original HTML 4 output from the "html" directory. 
The "xhtml" and "xhtml5" outputs already output UTF.


The original HTML 4 standard said ISO-8859-1 was the default encoding, 
but that UTF-8 would be acceptable.  It isn't difficult for a user to 
change the output to UTF-8, but it does require a customization.  The 
question here is whether to change the default output encoding to UTF-8.


This would change the HTML output to replace character references like