On 12.03.2018 16:17, Jonas Wielicki wrote:
> On Montag, 12. März 2018 15:56:04 CET Sam Whited wrote:
>> On Mon, Mar 12, 2018, at 09:20, Jonas Wielicki wrote:
>> because just as
>> scalar values can be made up of multiple bytes, glyphs (or "grapheme
>> clusters") may be made up of multiple scalar values (and, as you pointed
>> out, the range could end in the middle of a grapheme cluster that uses
>> multiple scalar values).
>>
>> In my mind there are only two things that make sense here:
>>
>> - Use bytes and come up with a way to handle bad ranges that end in the
>> middle of a UTF-8 sequence 
> 
> That proposal does not make sense at all. It doesn’t solve the issue of 
> having 
> a range start or end in the middle of a grapheme cluster, and it introduces 
> extra complexity by requiring implementations to re-obtain a UTF-8 
> representation of the character data (or keep it around). Sounds like the 
> worst of both worlds (Grapheme Clusters vs. Scalar Values). XML Character 
> Data 
> is specified in Scalar Values (they call it Characters, but it really is a 
> Scalar Value minus \uFFFF and \uFFFE), so it makes most sense to re-use that.
> 
>> - Use grapheme clusters and require that
>> everyone implement the segmentation algorithm
> 
> This will bring us all kinds of issues with different unicode versions.
> 
>> I lean towards bytes because it keeps things simple and 
> 
> Then let’s stay with Scalar Values, which is what XML works with, instead of 
> using a lower-level representation.

I'm also leaning towards this.

And possibly specify that a pointer to the start or the middle of a
grapheme cluster is not recommended, and if found, should be treated as
a pointer to the cluster itself.

- Florian

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
_______________________________________________

Reply via email to