Hi Joshua On Sun, May 06, 2018 at 10:11:32PM -0500, Joshua Watt wrote: > On Sun, May 6, 2018 at 3:37 PM, Dorota Czaplejewicz > <dorota.czaplejew...@puri.sm> wrote: > > Unless someone chimes in with more arguments, I'm going to keep > > using code points in following revisions. > > I don't mean to do a drive by or bikeshed, I do actually have a vested > interest in this protocol (I've implemented the previous IM protocols > on Webkit For Wayland). I've really been meaning to try it out, but > haven't yet had time. I also have quite a bit of experience with > unicode (and specifically UTF-8) due to my day job, so I wanted to > chime in... > > IMHO, if you are doing UTF-8 (which you should), you should *always* > specify any offset in the string as a byte offset. I have a few > reasons for this justification: > 1. Unicode is *hard*, and it has a lot of terms that people aren't > always familiar with (code points, glyphs, encodings, and the worst > overloaded term "characters"). "a byte offset in UTF-8" should be > universally and unambiguously understood. > 2. Even if you specified the cursor offset as an index into a UTF-32 > array of codepoints, you *still* could end up with the cursor "in > between" a printed glyph due to combining diactiricals.
This case should be covered by the following paragraph in the protocol spec: + Text is valid UTF-8 encoded, indices and lengths are in code points. If a + grapheme is made up of multiple code points, an index pointing to any of + them should be interpreted as pointing to the first one. > 3. Due to UTF-8's self syncronizing encoding, it is actually very > easy to determine if a given byte is the start of a code point, or in > the middle (and even determine *which* byte in the sequence it is). > Consequently, if you do find the offset is in the middle of a > codepoint, it is pretty trivial to either move to the next code point, > or move back to the beginning of the current code point. As such, I > have always found byte a more useful offset, because it can more > easily be converted to a code point than the other way around. This property of UTF-8 only makes it easier to recover from an issue you won't have to deal with at all if you specify the offsets in Unicode code points... > 4. As more of a "gut feel" sort of thing.... A Wayland protocol is a > pretty well defined binary API (like a networking API...), and > specifying in bytes feels more "stable"... Sorry I really don't have > solid data to back that up, but I would need a lot of convincing that > codepoints were better if someone was proposing throwing this data in > a UDP packet and blasting it across a network :) I am afraid gut feels don't count. And I am with you on this :P Cheers, Silvan _______________________________________________ wayland-devel mailing list wayland-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/wayland-devel