On 20-12-2019 12:55, Andrew Nenakhov wrote:
чт, 19 дек. 2019 г. в 19:02, Ralph Meijer <[email protected] <mailto:[email protected]>>:If you want consistent counting on all platforms and languages, counting Unicode characters seems to be the best way forward.We do not dispute that 'counting unicode characters seems the best way forward'. However, we do dispute when to count them. It's more of a preference issue, but we chose to count characters in the XML doc we send, because XML standard is common for any platform and language.
Just to be clear. An XML Stream is encoded in UTF-8 and has additional processing (like entities) to represent a text. While does series of UTF-8 encoded characters are themselves also represent a sequence of Unicode characters (let's call them seq1), that sequence is not necessarily equivalent to the abstract sequence of characters that represents the above mentioned text (seq2).
Counting in seq1 and seq2 are different things as soon as there a CDATA sections, entities, etc, and I consider counting seq1 to be the wrong approach. I.e. I expect the character count for the text in the body element of the following equivalent XML snippets to be exactly 1 (the sequence containing the single character U+003c), and not 4, 5, 9, or 13, irregardless of where you choose to count:
<body><</body> <body><</body> <body><</body> <body><![CDATA[<]]></body> -- ralphm _______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: [email protected] _______________________________________________
