I presume that the majority of implementations will do the UTF-8 decoding 
before/during XML parsing, so with offsets specified as bytes they will likely 
awkwardly re-encode the string again to be able to cross-reference these byte 
offsets with the codepoint* offsets they need.

For those which must operate on the byte level, anything other than byte 
offsets is going to be awkward. You can still manage without fully decoding 
UTF-8 however, as all non-head bytes have the pattern 01xxxxxx, so counting 
only head bytes will lead you to the correct start-of-codepoint - though it's 
obviously a little more work than direct indexing.

Bytes has the possibility of all error cases that codepoints has, but bytes has 
the additional possibility of offsets landing mid-codepoint, while that's 
impossible if codepoints are your units.

As for mid-glyph offsets, is it such a problem beyond possibly displaying 
badly? Where it's assumed to be an error, an easy solution would be to quietly 
round the start/end offsets to the start/end of their glyphs - obviously this 
is handled most efficiently by the display layer, but presumably that's the 
only place it matters anyway.

>From another angle, I'd position XMPP above XML, and XML above the text 
>encoding scheme used (UTF-8), so then it seems wrong to be concerning 
>ourselves with details of the encoding scheme from the top level.



* It's probably worth mentioning that there are a number of confusions people 
have with Unicode, and saying 'character' when they mean 'codepoint' is one of 
them (as they're equivalent for the single-codepoint characters they're 
familiar with.)
_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
_______________________________________________

Reply via email to