Re: [xml] Serialization of documents without encoding

2018-09-27 Thread Nick Wellnhofer

On 25/09/2018 14:36, Nick Wellnhofer wrote:
The whole situation is a mess. I'd love to change the code so that non-ASCII 
chars are always encoded as UTF-8, but I'm scared to break things.


This is the change I have in mind:

https://github.com/nwellnhof/libxml2/commit/53551ec2f6a2ef03bfcfb6d73b6fd18dc70ba15d

Nick

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Serialization of documents without encoding

2018-09-27 Thread Nick Wellnhofer

On 27/09/2018 10:59, Roumen Petrov wrote:

Let consider case as "file" mode.



Let consider case as "stream" code.


I'm not only talking about xmllint but the serialization API (xmlSave*, 
xmlNodeDump*) in general.


Now about above test samples . if content is stored in file xmllint works fine 
with encoding(=codeset=charset).


$ cat test-noencoding.xml
Käse


No, it doesn't work fine:

$ xmllint test-noencoding.xml

Kse

(2) Next a-umlaut character is encoded in hexadecimal. Minor inconsistency 
between "stream" and "file" mode.


As shown above, "file" mode can also produce unwanted numeric character 
references.


(3) Problem is that in "scream" mode xmllint application ignores value of 
encode argument:

$ echo 'Käse' | xmllint - --encode UTF-8

Kse


Right, there is an inconsistency in xmllint. But that's not my point.

 From my point of view (1) and (2) are minor non-important issues. Only (3) 
could be fixed with low priority.


Unneeded numeric character references in UTF-8 output are not a minor issue. 
If you're working with non-Latin scripts, it makes serialized XML files 
unreadable for humans and blows up the file size.


Nick

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Serialization of documents without encoding

2018-09-27 Thread Roumen Petrov

Hi Nick,

Hi,

Nick Wellnhofer wrote:
libxml2 serializes documents without an encoding declaration 
differently than documents with an explicit UTF-8 encoding:


$ echo 'Käse' |xmllint -

Kse

$ echo 'Käse' |xmllint -

Käse

Since the encoding should default to UTF-8, can anyone explain why 
this decision was made?


I'm not sure that only xml related content is enough to take decision.

If file starts with 16-bit BOM processor should use this encoding and 
should ignore encoding specified in prolog.
About 8-bit BOM - this is program error but user friendly application 
may accept it and so to consider xml in UTF-8 and to ignore encoding 
from prolog.

Let consider case as "file" mode.

Next case is externally specified encoding. For instance in HTTP 
protocol - for example if header has line "Content-Type: text/xml; 
charset=utf-8" (see rfc3023).

If charset is omitted xml processor must use "us-ascii" as default.
Note that in both cases encoding specified on xml prolog is ignored . 
This is per rfc3023 "XML Media Types" ;).


Let consider case as "stream" code.

Also above means that application is responsible to set encoding before xml 
library to process document


Now about above test samples . if content is stored in file xmllint works fine 
with encoding(=codeset=charset).

$ cat test-noencoding.xml
Käse

$ xmllint test-noencoding.xml --encode ISO8859-1 | iconv -f ISO8859-1

Käse

$ xmllint test-noencoding.xml --encode ISO8859-5

Kse

$ xmllint test-noencoding.xml --encode us-ascii

Kse

Remark: decimal 228 is equal to hexadecimal xE4.


Now about your "stream" example : echo 'Käse' | 
xmllint -

(1) First is visible that in output xml prolog lack encoding. Perhaps is good 
xmllint to produce such information.
For instance in rfc3023 charset is optional but document "STRONGLY RECOMMEND" 
use of the charset parameter.

(2) Next a-umlaut character is encoded in hexadecimal. Minor 
inconsistency between "stream" and "file" mode.


(3) Problem is that in "scream" mode xmllint application ignores value 
of encode argument:

$ echo 'Käse' | xmllint - --encode UTF-8

Kse

From my point of view (1) and (2) are minor non-important issues. Only 
(3) could be fixed with low priority.



Report look like issue in application code not in library.



Nick


Regards,
Roumen Petrov

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml