On Thu, 10 May 2018 11:43:12 +0200
Dorota Czaplejewicz <dorota.czaplejew...@puri.sm> wrote:

> On Tue, 08 May 2018 07:07:24 +0000
> Silvan Jegen <s.je...@gmail.com> wrote:
> 
> > On Mon, May 7, 2018 at 5:11 AM Joshua Watt <jpewhac...@gmail.com> wrote:  
> > > IMHO, if you are doing UTF-8 (which you should), you should *always*
> > > specify any offset in the string as a byte offset. I have a few
> > > reasons for this justification:    
> > 
> > I agree with this as well. I thought some more about how to spell out my
> > gut feeling on this matter in more technical terms.
> > 
> > UTF-8 is a byte (sequence) representation of Unicode code points. This
> > indicates to me that an offset within an UTF-8-encoded string should also
> > be given in bytes. Specifying the offset in Unicode points mixes the
> > abstraction of the Unicode code point with (one of) its representations as
> > a byte sequence. This is reflected in the fact that an offset in Unicode
> > code points is not applicable to the UTF-8 string without first processing
> > the string.
> > 
> > Unicode code points do not give us that much either since what we most
> > likely want are grapheme clusters anyway (which, like any more advanced
> > Unicode processing, should be handled by a specialised library):
> > http://utf8everywhere.org/#myth.strlen
> > 
> > 
> > Cheers,
> > 
> > Silvan  
> 
> This message made me feel obliged to turn my own gut feeling into words. This 
> is not to be construed as an argument, but more of an explanation.
> 
> I view wayland protocols as rather high level: their responsibility is to 
> specify the type and the purpose of the data they are transporting. In this 
> case, the data is a Unicode string, and the purpose is display. Or, the data 
> is a number and the purpose is indexing.
> 
> I think that when a protocol starts to specify the type and purpose, it can 
> no longer be thought as high level. In this view, indexing a Unicode string 
> in terms of bytes would be akin to indexing any other vector of Foo in bytes. 
> (I didn't actually check if there is any other vector, or bytes type 
> available in wayland).
> 
> As you noted, there is some mixing between abstraction levels in the 
> protocol. Hardcoding that it's not *just* Unicode, but also the particular 
> encoding (UTF-8) eliminates problems with byte indexing we would have 
> encountered if we decided to use things like Punycode (München => 
> Mnchen-3ya). Knowing that it's always UTF-8 allows the protocol to use a 
> tailoring indexing scheme. While I consider this a layer-breaking hack, 
> nevertheless, this property partially counters the above reasoning.
> 
> * * *
> 
> To be honest, neither Unicode code points nor graphemes nor clusters are what 
> we're truly looking for here. To understand what I mean, I recommend to play 
> with this grapheme cluster:
> 
> नमस्ते
> 
> According to the Rust book [0], it's composed of 6 code points: ['न', 'म', 
> 'स', '्', 'त', 'े'], but moving the cursor around, I would be led to believe 
> it's 4 "pieces" long only.
> 
> Cheers,
> Dorota
> 
> [0] https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html

On a second thought, perhaps graphemes are actually the relevant thing here...

Attachment: pgpM9K5WOPO5U.pgp
Description: OpenPGP digital signature

_______________________________________________
wayland-devel mailing list
wayland-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/wayland-devel

Reply via email to