On 12.03.2018 16:17, Jonas Wielicki wrote: > On Montag, 12. März 2018 15:56:04 CET Sam Whited wrote: >> On Mon, Mar 12, 2018, at 09:20, Jonas Wielicki wrote: >> because just as >> scalar values can be made up of multiple bytes, glyphs (or "grapheme >> clusters") may be made up of multiple scalar values (and, as you pointed >> out, the range could end in the middle of a grapheme cluster that uses >> multiple scalar values). >> >> In my mind there are only two things that make sense here: >> >> - Use bytes and come up with a way to handle bad ranges that end in the >> middle of a UTF-8 sequence > > That proposal does not make sense at all. It doesn’t solve the issue of > having > a range start or end in the middle of a grapheme cluster, and it introduces > extra complexity by requiring implementations to re-obtain a UTF-8 > representation of the character data (or keep it around). Sounds like the > worst of both worlds (Grapheme Clusters vs. Scalar Values). XML Character > Data > is specified in Scalar Values (they call it Characters, but it really is a > Scalar Value minus \uFFFF and \uFFFE), so it makes most sense to re-use that. > >> - Use grapheme clusters and require that >> everyone implement the segmentation algorithm > > This will bring us all kinds of issues with different unicode versions. > >> I lean towards bytes because it keeps things simple and > > Then let’s stay with Scalar Values, which is what XML works with, instead of > using a lower-level representation.
I'm also leaning towards this. And possibly specify that a pointer to the start or the middle of a grapheme cluster is not recommended, and if found, should be treated as a pointer to the cluster itself. - Florian
Description: OpenPGP digital signature
_______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org _______________________________________________