Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

Ralph Meijer Fri, 20 Dec 2019 04:16:04 -0800


On 20-12-2019 12:55, Andrew Nenakhov wrote:

чт, 19 дек. 2019 г. в 19:02, Ralph Meijer <[email protected]<mailto:[email protected]>>:
    If you want consistent counting on all platforms and languages,
    counting
    Unicode characters seems to be the best way forward.
We do not dispute that 'counting unicode characters seems the best wayforward'. However, we do dispute when to count them. It's more of apreference issue, but we chose to count characters in the XML doc wesend, because XML standard is common for any platform and language.

Just to be clear. An XML Stream is encoded in UTF-8 and has additionalprocessing (like entities) to represent a text. While does series ofUTF-8 encoded characters are themselves also represent a sequence ofUnicode characters (let's call them seq1), that sequence is notnecessarily equivalent to the abstract sequence of characters thatrepresents the above mentioned text (seq2).

Counting in seq1 and seq2 are different things as soon as there a CDATAsections, entities, etc, and I consider counting seq1 to be the wrongapproach. I.e. I expect the character count for the text in the bodyelement of the following equivalent XML snippets to be exactly 1 (thesequence containing the single character U+003c), and not 4, 5, 9, or13, irregardless of where you choose to count:


  <body>&lt;</body>
  <body>&#60;</body>
  <body>&#x0003C;</body>
  <body><![CDATA[<]]></body>

--
ralphm
_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: [email protected]
_______________________________________________

Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

Reply via email to