Re: [xml] Serialization of documents without encoding
On Thu, Sep 27, 2018 at 02:22:55PM +0200, Nick Wellnhofer wrote: > On 27/09/2018 10:59, Roumen Petrov wrote: > > Let consider case as "file" mode. > > > Let consider case as "stream" code. > > I'm not only talking about xmllint but the serialization API (xmlSave*, > xmlNodeDump*) in general. > > > Now about above test samples . if content is stored in file xmllint > > works fine with encoding(=codeset=charset). > > > > $ cat test-noencoding.xml > > Käse > > No, it doesn't work fine: > > $ xmllint test-noencoding.xml > > Kse > > > (2) Next a-umlaut character is encoded in hexadecimal. Minor > > inconsistency between "stream" and "file" mode. > > As shown above, "file" mode can also produce unwanted numeric character > references. > > > (3) Problem is that in "scream" mode xmllint application ignores value > > of encode argument: > > $ echo 'Käse' | xmllint - --encode UTF-8 > > > > Kse > > Right, there is an inconsistency in xmllint. But that's not my point. > > > From my point of view (1) and (2) are minor non-important issues. Only > > (3) could be fixed with low priority. > > Unneeded numeric character references in UTF-8 output are not a minor issue. > If you're working with non-Latin scripts, it makes serialized XML files > unreadable for humans and blows up the file size. Not breaking a decade os programs who may be expecting that behaviour sounds far more important to me honnestly. Daniel -- Daniel Veillard | Red Hat Developers Tools http://developer.redhat.com/ veill...@redhat.com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | virtualization library http://libvirt.org/ ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Serialization of documents without encoding
On Tue, Sep 25, 2018 at 01:19:51PM +0200, Nick Wellnhofer wrote: > libxml2 serializes documents without an encoding declaration differently > than documents with an explicit UTF-8 encoding: > > $ echo 'Käse' |xmllint - > > Kse > > $ echo 'Käse' |xmllint - > > Käse > > Since the encoding should default to UTF-8, can anyone explain why this > decision was made? Because using the codepoint is part of the core XML spec, there is no way this can be screwed up when people are doing manipulations like cutting parts of an XML document, pasting it somewhere else where the context may be differemt. So if you don't explicitely ask for an encoding libxml2 will deliver the most resilient serialization possible and that means using codepoint, except where not possible (and then specifics about attributes serialization, etc ...) Please keep it that way, you have no idea what people may have done and unless this really fixes an issue I would be very reluctant to change this behaviour. thanks, Daniel -- Daniel Veillard | Red Hat Developers Tools http://developer.redhat.com/ veill...@redhat.com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | virtualization library http://libvirt.org/ ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Serialization of documents without encoding
Sorry I ddin't watch my xml folder for a while ... a bit busy On Sat, Oct 06, 2018 at 07:32:00PM +0300, Roumen Petrov wrote: > Hi Nick, > > Nick Wellnhofer wrote: > > On 25/09/2018 14:36, Nick Wellnhofer wrote: > > > The whole situation is a mess. I'd love to change the code so that > > > non-ASCII chars are always encoded as UTF-8, but I'm scared to break > > > things. > > Long time ago I did some test with html - > http://roumenpetrov.info/tests/charset/ . > > The case is quite similar - encoding could be defined externally in HTTP > header Except it usually doesn't work so tons of workarounds need to be applied. > Content-Type: text/html; charset=ISO8859-5 > ... > and in the same time in HTML header (internal) > ... > > > > > > > ... > If I remember well (10-15 ago) Internet Explorer prefer internal while other > browsers prefer external encoding. yup it was a mess. I heard horror stories from various parties implementing even XML support in browsers. > > I create similar test to check what is situation with xml > http://roumenpetrov.info/tests/charset/index-xml.html and dis some tests ( > ( browsers - Firefox, Opera, Chromium, Konqueror ). > > The test show that all(1) browsers could read xml in following case : > - HTTP header without charset, i.e. Content-Type: text/html; > - XML prolog with encoding, i.e. > > Without encoding in prolog only file in UTF-8 codeset could be read (no > surprise). > > Behavior of some browsers depend from file suffix . This is reason to test > to use .xml and .none suffixes. > > Mix between charset and encoding fail as expected exept in case > charset=iso8859-1 where some browsers show properly content. > > > Based on tests I think that switch to UTF-8 encoded content by default is > good to have encoding in prolog. It is less risky. > > > > This is the change I have in mind: > > > > https://github.com/nwellnhof/libxml2/commit/53551ec2f6a2ef03bfcfb6d73b6fd18dc70ba15d > > Ok to remove "Special escaping routines" but patch shows that in regression > tests prolog remains as "". > I'm not sure that such code modification is save. > That kind of things can backfire *very* easilly. What is the problem we are trying to solve. Some people are likely to expect the behaviour of going back to codepoint when no encoding is specified outside of the ascii range. Daniel -- Daniel Veillard | Red Hat Developers Tools http://developer.redhat.com/ veill...@redhat.com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | virtualization library http://libvirt.org/ ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Serialization of documents without encoding
Hi Nick, Nick Wellnhofer wrote: On 25/09/2018 14:36, Nick Wellnhofer wrote: The whole situation is a mess. I'd love to change the code so that non-ASCII chars are always encoded as UTF-8, but I'm scared to break things. Long time ago I did some test with html - http://roumenpetrov.info/tests/charset/ . The case is quite similar - encoding could be defined externally in HTTP header ... Content-Type: text/html; charset=ISO8859-5 ... and in the same time in HTML header (internal) ... ... If I remember well (10-15 ago) Internet Explorer prefer internal while other browsers prefer external encoding. I create similar test to check what is situation with xml http://roumenpetrov.info/tests/charset/index-xml.html and dis some tests ( ( browsers - Firefox, Opera, Chromium, Konqueror ). The test show that all(1) browsers could read xml in following case : - HTTP header without charset, i.e. Content-Type: text/html; - XML prolog with encoding, i.e. Without encoding in prolog only file in UTF-8 codeset could be read (no surprise). Behavior of some browsers depend from file suffix . This is reason to test to use .xml and .none suffixes. Mix between charset and encoding fail as expected exept in case charset=iso8859-1 where some browsers show properly content. Based on tests I think that switch to UTF-8 encoded content by default is good to have encoding in prolog. It is less risky. This is the change I have in mind: https://github.com/nwellnhof/libxml2/commit/53551ec2f6a2ef03bfcfb6d73b6fd18dc70ba15d Ok to remove "Special escaping routines" but patch shows that in regression tests prolog remains as "". I'm not sure that such code modification is save. Nick Regards, Roumen ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Serialization of documents without encoding
On 25/09/2018 14:36, Nick Wellnhofer wrote: The whole situation is a mess. I'd love to change the code so that non-ASCII chars are always encoded as UTF-8, but I'm scared to break things. This is the change I have in mind: https://github.com/nwellnhof/libxml2/commit/53551ec2f6a2ef03bfcfb6d73b6fd18dc70ba15d Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Serialization of documents without encoding
On 27/09/2018 10:59, Roumen Petrov wrote: Let consider case as "file" mode. Let consider case as "stream" code. I'm not only talking about xmllint but the serialization API (xmlSave*, xmlNodeDump*) in general. Now about above test samples . if content is stored in file xmllint works fine with encoding(=codeset=charset). $ cat test-noencoding.xml Käse No, it doesn't work fine: $ xmllint test-noencoding.xml Kse (2) Next a-umlaut character is encoded in hexadecimal. Minor inconsistency between "stream" and "file" mode. As shown above, "file" mode can also produce unwanted numeric character references. (3) Problem is that in "scream" mode xmllint application ignores value of encode argument: $ echo 'Käse' | xmllint - --encode UTF-8 Kse Right, there is an inconsistency in xmllint. But that's not my point. From my point of view (1) and (2) are minor non-important issues. Only (3) could be fixed with low priority. Unneeded numeric character references in UTF-8 output are not a minor issue. If you're working with non-Latin scripts, it makes serialized XML files unreadable for humans and blows up the file size. Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Serialization of documents without encoding
Hi Nick, Hi, Nick Wellnhofer wrote: libxml2 serializes documents without an encoding declaration differently than documents with an explicit UTF-8 encoding: $ echo 'Käse' |xmllint - Kse $ echo 'Käse' |xmllint - Käse Since the encoding should default to UTF-8, can anyone explain why this decision was made? I'm not sure that only xml related content is enough to take decision. If file starts with 16-bit BOM processor should use this encoding and should ignore encoding specified in prolog. About 8-bit BOM - this is program error but user friendly application may accept it and so to consider xml in UTF-8 and to ignore encoding from prolog. Let consider case as "file" mode. Next case is externally specified encoding. For instance in HTTP protocol - for example if header has line "Content-Type: text/xml; charset=utf-8" (see rfc3023). If charset is omitted xml processor must use "us-ascii" as default. Note that in both cases encoding specified on xml prolog is ignored . This is per rfc3023 "XML Media Types" ;). Let consider case as "stream" code. Also above means that application is responsible to set encoding before xml library to process document Now about above test samples . if content is stored in file xmllint works fine with encoding(=codeset=charset). $ cat test-noencoding.xml Käse $ xmllint test-noencoding.xml --encode ISO8859-1 | iconv -f ISO8859-1 Käse $ xmllint test-noencoding.xml --encode ISO8859-5 Kse $ xmllint test-noencoding.xml --encode us-ascii Kse Remark: decimal 228 is equal to hexadecimal xE4. Now about your "stream" example : echo 'Käse' | xmllint - (1) First is visible that in output xml prolog lack encoding. Perhaps is good xmllint to produce such information. For instance in rfc3023 charset is optional but document "STRONGLY RECOMMEND" use of the charset parameter. (2) Next a-umlaut character is encoded in hexadecimal. Minor inconsistency between "stream" and "file" mode. (3) Problem is that in "scream" mode xmllint application ignores value of encode argument: $ echo 'Käse' | xmllint - --encode UTF-8 Kse From my point of view (1) and (2) are minor non-important issues. Only (3) could be fixed with low priority. Report look like issue in application code not in library. Nick Regards, Roumen Petrov ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Serialization of documents without encoding
On 25/09/2018 13:19, Nick Wellnhofer wrote: libxml2 serializes documents without an encoding declaration differently than documents with an explicit UTF-8 encoding: It seems that this was partially changed in 2005 with the following commit: https://gitlab.gnome.org/GNOME/libxml2/commit/64354ea7d6b8e0d95f3f9bcfdc98bddd065b65fc But this change only applies to text nodes, not attribute content. It also only applies when serializing with xmlNodeDumpOutput or xmlNodeDump, not when using the xmlSave API (which xmllint uses). The whole situation is a mess. I'd love to change the code so that non-ASCII chars are always encoded as UTF-8, but I'm scared to break things. Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml