On 12/4/20 3:27 PM, Sam Whited wrote:
FWIW I was a big proponent of doing it this way too, but I've changed my
mind after seeing too many grapheme segmentation implementations be
broken in small, different, ways. My new position is that we have to
just count bytes and figure out a sane behavior in case someone sends us
an invalid offset in the middle of a codepoint or something. This is
encoding agnostic (not that it matters for XMPP) and makes it very easy
to count: go to that byte offset, check if we're on any sort of UTF-8
boundary, if so call it a day, if not do whatever the fallback is.

This also reads like it is mixing multiple independent layers, i.e. the bytes on the wire with the data you receive in the higher layers, e.g. your XMPP API may provide a method Message.getBody(), which returns a String. But this String will be represented in your programming language's native String representation, which may or may not match the bytes on the wire.

As I do not know any alternative, grapheme cluster counting is the only sound way for interoperability and does not exclude our friends from all over the world and their characters. Which is important to me.

However, I have a counter proposal that goes into a similar direction as yours: Even if the specification asks for grapheme clusters, there is nothing wrong to fallback to character counting if you haven't implemented grapheme cluster counting (yet). I would expect that it will just work most of the time (for users of the arabic alphabet).

While this does in no way allow for sound interoperability, it is some sort of opportunistic interoperability.

- Florian

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: [email protected]
_______________________________________________

Reply via email to