Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

Florian Schmaus Fri, 04 Dec 2020 12:54:51 -0800

On 12/4/20 9:33 PM, Sam Whited wrote:

And I am in favor of code points because it allows us to aim for the
extended grapheme cluster algorithm, while also allowing for the
"simply count code points" fallback.


If you do bytes you could also easily convert to codepoints and then to
grapheme clusters. It also allows for the simple "count codepoints" or
"count bytes" fallback.

If you count the bytes of the UTF-8 encoded representation, then there is no way to have any fallback (as the indexes would be wrong).

Maybe an example is able to illustrate where I see the advantage of counting graphemes/code points over counting the bytes of the UTF-8 encoded representation. Consider the following text:


Über

Code points: U+00DC U+0062 U+0065 U+0072
Graphemes:   (U+00DC) (U+0062) (U+0065) (U+0072)
UTF-8 bytes: c3 8b 62 65 72

Assume we want to provide the coordinates for the span that consists of the first two letters. e.g.:


Über
^^

Then, with a zero-indexes scheme where start is inclusive and end is exclsuive, you may either end up with


start=0
end=3

if you count bytes.

But you end up with

start=0
end=2

irregardless of counting code points or graphemes.

This is, of course, because in the example the number of code points and graphemes is identical. But this allows developers to easily bootstrap this scheme by simply counting code points in the beginning. I wouldn't be surprised if that it would work so well that they never even switch to grapheme counting.


- Florian

OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: [email protected]
_______________________________________________

Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

Reply via email to