On Thu, 10 May 2018 11:43:12 +0200 Dorota Czaplejewicz <dorota.czaplejew...@puri.sm> wrote:
> On Tue, 08 May 2018 07:07:24 +0000 > Silvan Jegen <s.je...@gmail.com> wrote: > > > On Mon, May 7, 2018 at 5:11 AM Joshua Watt <jpewhac...@gmail.com> wrote: > > > IMHO, if you are doing UTF-8 (which you should), you should *always* > > > specify any offset in the string as a byte offset. I have a few > > > reasons for this justification: > > > > I agree with this as well. I thought some more about how to spell out my > > gut feeling on this matter in more technical terms. > > > > UTF-8 is a byte (sequence) representation of Unicode code points. This > > indicates to me that an offset within an UTF-8-encoded string should also > > be given in bytes. Specifying the offset in Unicode points mixes the > > abstraction of the Unicode code point with (one of) its representations as > > a byte sequence. This is reflected in the fact that an offset in Unicode > > code points is not applicable to the UTF-8 string without first processing > > the string. > > > > Unicode code points do not give us that much either since what we most > > likely want are grapheme clusters anyway (which, like any more advanced > > Unicode processing, should be handled by a specialised library): > > http://utf8everywhere.org/#myth.strlen > > > > > > Cheers, > > > > Silvan > > This message made me feel obliged to turn my own gut feeling into words. This > is not to be construed as an argument, but more of an explanation. > > I view wayland protocols as rather high level: their responsibility is to > specify the type and the purpose of the data they are transporting. In this > case, the data is a Unicode string, and the purpose is display. Or, the data > is a number and the purpose is indexing. > > I think that when a protocol starts to specify the type and purpose, it can > no longer be thought as high level. In this view, indexing a Unicode string > in terms of bytes would be akin to indexing any other vector of Foo in bytes. > (I didn't actually check if there is any other vector, or bytes type > available in wayland). > > As you noted, there is some mixing between abstraction levels in the > protocol. Hardcoding that it's not *just* Unicode, but also the particular > encoding (UTF-8) eliminates problems with byte indexing we would have > encountered if we decided to use things like Punycode (München => > Mnchen-3ya). Knowing that it's always UTF-8 allows the protocol to use a > tailoring indexing scheme. While I consider this a layer-breaking hack, > nevertheless, this property partially counters the above reasoning. > > * * * > > To be honest, neither Unicode code points nor graphemes nor clusters are what > we're truly looking for here. To understand what I mean, I recommend to play > with this grapheme cluster: > > नमस्ते > > According to the Rust book [0], it's composed of 6 code points: ['न', 'म', > 'स', '्', 'त', 'े'], but moving the cursor around, I would be led to believe > it's 4 "pieces" long only. > > Cheers, > Dorota > > [0] https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html On a second thought, perhaps graphemes are actually the relevant thing here...
pgpM9K5WOPO5U.pgp
Description: OpenPGP digital signature
_______________________________________________ wayland-devel mailing list wayland-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/wayland-devel