I presume that the majority of implementations will do the UTF-8 decoding before/during XML parsing, so with offsets specified as bytes they will likely awkwardly re-encode the string again to be able to cross-reference these byte offsets with the codepoint* offsets they need.
For those which must operate on the byte level, anything other than byte offsets is going to be awkward. You can still manage without fully decoding UTF-8 however, as all non-head bytes have the pattern 01xxxxxx, so counting only head bytes will lead you to the correct start-of-codepoint - though it's obviously a little more work than direct indexing. Bytes has the possibility of all error cases that codepoints has, but bytes has the additional possibility of offsets landing mid-codepoint, while that's impossible if codepoints are your units. As for mid-glyph offsets, is it such a problem beyond possibly displaying badly? Where it's assumed to be an error, an easy solution would be to quietly round the start/end offsets to the start/end of their glyphs - obviously this is handled most efficiently by the display layer, but presumably that's the only place it matters anyway. >From another angle, I'd position XMPP above XML, and XML above the text >encoding scheme used (UTF-8), so then it seems wrong to be concerning >ourselves with details of the encoding scheme from the top level. * It's probably worth mentioning that there are a number of confusions people have with Unicode, and saying 'character' when they mean 'codepoint' is one of them (as they're equivalent for the single-codepoint characters they're familiar with.)
_______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org _______________________________________________