On Sun, May 6, 2018 at 3:37 PM, Dorota Czaplejewicz
<dorota.czaplejew...@puri.sm> wrote:
> On Sat, 5 May 2018 13:37:44 +0200
> Silvan Jegen <s.je...@gmail.com> wrote:
>
>> On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:
>> > On Fri, 4 May 2018 22:32:15 +0200
>> > Silvan Jegen <s.je...@gmail.com> wrote:
>> >
>> > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
>> > > > On Thu, 3 May 2018 21:55:40 +0200
>> > > > Silvan Jegen <s.je...@gmail.com> wrote:
>> > > >
>> > > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
>> > > > > > On Thu, 3 May 2018 20:47:27 +0200
>> > > > > > Silvan Jegen <s.je...@gmail.com> wrote:
>> > > > > >
>> > > > > > > Hi Dorota
>> > > > > > >
>> > > > > > > Some comments and typo fixes below.
>> > > > > > >
>> > > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz 
>> > > > > > > wrote:
>> > > > > > > > +      Text is valid UTF-8 encoded, indices and lengths are in 
>> > > > > > > > code points. If a
>> > > > > > > > +      grapheme is made up of multiple code points, an index 
>> > > > > > > > pointing to any of
>> > > > > > > > +      them should be interpreted as pointing to the first one.
>> > > > > > >
>> > > > > > > That way we make sure we don't put the cursor/anchor between 
>> > > > > > > bytes that
>> > > > > > > belong to the same UTF-8 encoded Unicode code point which is 
>> > > > > > > nice. It
>> > > > > > > also means that the client has to parse all the UTF-8 encoded 
>> > > > > > > strings
>> > > > > > > into Unicode code points up to the desired cursor/anchor position
>> > > > > > > on each "preedit_string" event. For each 
>> > > > > > > "delete_surrounding_text" event
>> > > > > > > the client has to parse the UTF-8 sequences before and after the 
>> > > > > > > cursor
>> > > > > > > position up to the requested Unicode code point.
>> > > > > > >
>> > > > > > > I feel like we are processing the UTF-8 string already in the
>> > > > > > > input-method. So I am not sure that we should parse it again on 
>> > > > > > > the
>> > > > > > > client side. Parsing it again would also mean that the client 
>> > > > > > > would need
>> > > > > > > to know about UTF-8 which would be nice to avoid.
>> > > > > > >
>> > > > > > > Thoughts?
>> > > > > >
>> > > > > > The client needs to know about Unicode, but not necessarily about
>> > > > > > UTF-8. Specifying code points is actually an advantage here, 
>> > > > > > because
>> > > > > > byte offsets are inherently expressed relative to UTF-8. By 
>> > > > > > counting
>> > > > > > with code points, client's internal representation can be UTF-16 or
>> > > > > > maybe even something else.
>> > > > >
>> > > > > Maybe I am misunderstanding something but the protocol specifies that
>> > > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets 
>> > > > > into
>> > > > > the strings are specified in Unicode points. To me that indicates 
>> > > > > that
>> > > > > the application *has to parse* the UTF-8 string into Unicode points
>> > > > > when receiving the event otherwise it doesn't know after which 
>> > > > > Unicode
>> > > > > point to draw the cursor. Of course the application can then decide 
>> > > > > to
>> > > > > convert the UTF-8 string into another encoding like UTF-16 for 
>> > > > > internal
>> > > > > processing (for whatever reason) but that doesn't change the fact 
>> > > > > that
>> > > > > it still would have to parse the incoming UTF-8 (and thus know about
>> > > > > UTF-8).
>> > > > >
>> > > > Can you see any way to avoid parsing UTF-8 in order to draw the
>> > > > cursor? I tried to come up with a way to do that, but even with
>> > > > specifying byte strings, I believe that calculating the position of
>> > > > the cursor - either in pixels or in glyphs - requires full parsing of
>> > > > the input string.
>> > >
>> > > Yes, I don't think it's avoidable either. You just don't have to do
>> > > it twice if your text rendering library consumes UTF-8 strings with
>> > > byte-offsets though. See my response below.
>> > >
>> > >
>> > > > > > There's no avoiding the parsing either. What the application cares
>> > > > > > about is that the cursor falls between glyphs. The application 
>> > > > > > cannot
>> > > > > > know that in all cases. Unicode allows the same sequence to be
>> > > > > > displayed in multiple ways (fallback):
>> > > > > >
>> > > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
>> > > > > >
>> > > > > > One could make an argument that byte offsets should never be close
>> > > > > > to ZWJ characters, but I think this decision is better left to the
>> > > > > > application, which knows what exactly it is presenting to the user.
>> > > > >
>> > > > > The idea of the previous version of the protocol (from my 
>> > > > > understanding)
>> > > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
>> > > > > falling between bytes of a Unicode code point) into the string will 
>> > > > > be
>> > > > > sent to the client. If you just get a byte-offset into a UTF-8 
>> > > > > encoded
>> > > > > string you trust the sender to honor the protocol and thus you can 
>> > > > > just
>> > > > > pass the UTF-8 encoded string unprocessed to your text rendering 
>> > > > > library
>> > > > > (provided that the library supports UTF-8 strings which is what I am
>> > > > > assuming) without having to parse the UTF-8 string into Unicode code
>> > > > > points.
>> > > > >
>> > > > > Of course the Unicode code points will have to be parsed at some 
>> > > > > point
>> > > > > if you want to render them. Using byte-offsets just lets you do that 
>> > > > > at
>> > > > > a later stage if your libraries support UTF-8.
>> > > > >
>> > > > >
>> > > > Doesn't that chiefly depend on what kind of the text rendering library
>> > > > though? As far as I understand, passing text to rendering is necessary
>> > > > to calculate the cursor position. At the same time, it doesn't matter
>> > > > much for the calculations whether the cursor offset is in bytes or
>> > > > code points - the library does the parsing in the last step anyway.
>> > > >
>> > > > I think you mean that if the rendering library accepts byte offsets
>> > > > as the only format, the application would have to parse the UTF-8
>> > > > unnecessarily. I agree with this, but I'm not sure we should optimize
>> > > > for this case. Other libraries may support only code points instead.
>> > > >
>> > > > Did I understand you correctly?
>> > >
>> > > Yes, that's what I meant. I also assumed that no text rendering library
>> > > expects you to pass the string length in Unicode points. I had a look
>> > > and the ones I managed to find expected their lengths in bytes:
>> > >
>> > > * Pango: 
>> > > https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
>> > > * Harfbuzz: https://harfbuzz.github.io/hello-harfbuzz.html
>> >
>> > I looked a bit deeper and found hb_buffer_add_utf8:
>> >
>> > https://cgit.freedesktop.org/harfbuzz/tree/src/hb-buffer.cc#n1576
>> >
>> > It seems to require both (either?) the number of bytes (for buffer
>> > size) and the number of code points in the same call. In this case, it
>> > doesn't matter how the position information is expressed.
>>
>> Haha, as an API I think that's horrible...
>>
>>
>> > > For those you would need to parse the UTF-8 string yourself first in
>> > > order to find out at which byte position the Unicodepoint stops where
>> > > the protocol wants you to draw the cursor (if the protocol sends Unicode
>> > > point offsets).
>> > >
>> > > I feel like it would make sense to optimize for the more common case. I
>> > > assume that is the one where you need to pass a length in bytes to the
>> > > text rendering library, not in Unicode points.
>> > >
>> > > Admittedly, I haven't used a lot of text rendering libraries so I would
>> > > very much like to hear more opinions on the issue.
>> > >
>> >
>> > Even if some libraries expect to work with bytes, I see three
>> > reasons not to provide them. Most importantly, I believe that we
>> > should avoid letting people shoot themselves in the foot whenever
>> > possible. Specifying bytes leaves a lot of wiggle room to communicate
>> > invalid state. The supporting reason is that protocols shouldn't be
>> > tied to implementation details.
>>
>> I agree that this is an advantage of using offsets measured in Unicode
>> code points.
>>
>> Still, it worries me to think about how for the next 10-20 years people
>> using these protocols have to parse their UTF-8 strings into Unicode
>> points twice for no good reason...
>>
>>
>> > The least important reason is that handling Unicode is getting better
>> > than it used to be. Taking Python as an example:
>> >
>>
>> That's true to some extent (personally I like Go's string and Unicode 
>> handling)
>> but Python is a bad example IMO. Python 3 handles strings this way while
>> Python 2 handels them in a completely different way:
>>
>> Python 2.7.15 (default, May  1 2018, 20:16:04)
>> [GCC 7.3.1 20180406] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> 'æþ'
>> '\xc3\xa6\xc3\xbe'
>> >>> 'æþ'[1]
>> '\xa6'
>>
>> and I am not sure either of them is easy and efficient to work with.
>>
>>
>> > >>> 'æþ'[1]
>> > 'þ'
>> > >>> len('æþ'.encode('utf-8'))
>> > 4
>> >
>> > Strings are natively indexed with code points. This matches at least
>> > my intuition when I'm asked to place a cursor somewhere inside a
>> > string and tell the index.
>>
>> Go expects all strings to be UTF-8 encoded and they are indexed by
>> byte. You can iterate over strings to get unicode points (called 'rune's
>> there) should you need them:
>>
>> for offset, r := range "æþ" {
>>    fmt.Printf("start byte pos: %d, code point: %c\n", offset, r)
>> }
>>
>> start byte pos: 0, code point: æ
>> start byte pos: 2, code point: þ
>>
>> Using Go's approach you can treat strings as UTF-8 bytes if that's all
>> you want to care about while still having an easy way to parse them into
>> Unicode points if you need them.
>>
>>
>> > In the end, I'm not an expert in that area either - perhaps treating
>> > client side strings as UTF-8 buffers makes sense, but at the moment
>> > I'm still leaning towards the code point abstraction.
>>
>> Someone (™) should probably implement a client making use of the protocol
>> to see what the real world impact of this protocol change would be.
>>
>> The editor in the weston project uses pango for its text layout:
>>
>> https://cgit.freedesktop.org/wayland/weston/tree/clients/editor.c#n824
>>
>> so it would have to parse the UTF-8 string twice. The same is most likely
>> true for all programs using GTK...
>>
>>
>
> I made an attempt to dig deeper, and while I stopped short of becoming this 
> Someone for now, I gathered what I think are some important results.
>
> First, the state of the libraries. There's a lot of data I gathered, so I'll 
> keep this section rather dense. First, another contender for the title of 
> text layout library, and that one uses code points exclusively:
>
> https://github.com/silnrsi/graphite/blob/master/include/graphite2/Segment.h 
> `gr_make_seg`
>
> https://github.com/silnrsi/graphite/blob/master/tests/examples/simple.c
>
> Afterwards, I focused on GTK and Qt. As an input method plugin developer, I 
> looked at the IM interfaces and internal data structures they expose. The 
> results were not that clear - no mention of "code points", some references to 
> "bytes", many to "characters" (not "chars"). What is certain is that there's 
> a lot of converting going on behind the scenes anyway. First off, GTK seems 
> to be moving away from bytes, judging by the comments:
>
> gtk 3.22 (`gtkimcontext.c`)
>
> `gtk_im_context_delete_surrounding`
>
>> * Asks the widget that the input context is attached to to delete
>> * characters around the cursor position by emitting the
>> * GtkIMContext::delete_surrounding signal. Note that @offset and @n_chars
>> * are in characters not in bytes which differs from the usage other
>> * places in #GtkIMContext.
>
> `gtk_im_context_get_preedit_string`
>
>> * @cursor_pos: (out): location to store position of cursor (in characters)
>> *              within the preedit string.
>
> `gtk_im_context_get_surrounding`
>
>> * @cursor_index: (out): location to store byte index of the insertion
>> *        cursor within @text.
>
> gtkEntry seems to store things internally as characters.
>
> While GTK using code points internally is not a proof of anything, it's a 
> suggestion that there is a reason not to use bytes.
>
> Then, Qt, from https://doc.qt.io/qt-5/qinputmethodevent.html#setCommitString
>
>> replaceLength specifies the number of characters to be replaced
>
> a confirmation that "characters" means "code points" comes from 
> https://doc.qt.io/qt-5/qlineedit.html#cursorPosition-prop . The value 
> reported when "æþ|" is displayed is 2.
>
> I also spent more time than I should writing a demo implementation of an 
> input method and a client connecting to it to check out the proposed 
> interfaces. Predictably, it gave me a lot of trouble on the edges between 
> bytes and code points, but I blame it on Rust's scarcity of UTF handling 
> functions. The hack is available at 
> https://code.puri.sm/dorota.czaplejewicz/impoc
>
> My impression at the moment is that it doesn't matter much how offsets within 
> UTF strings are encoded, but that code points slightly better reflect what's 
> going on in the GUI toolkits, apart from the benefits mentioned in my other 
> emails. There seems to be so much going on behind the scenes and the parsing 
> is so cheap that it doesn't make sense to worry about the computational 
> aspect, just try to make things easier to get right.
>
> Unless someone chimes in with more arguments, I'm going to keep using code 
> points in following revisions.

I don't mean to do a drive by or bikeshed, I do actually have a vested
interest in this protocol (I've implemented the previous IM protocols
on Webkit For Wayland). I've really been meaning to try it out, but
haven't yet had time. I also have quite a bit of experience with
unicode (and specifically UTF-8) due to my day job, so I wanted to
chime in...

IMHO, if you are doing UTF-8 (which you should), you should *always*
specify any offset in the string as a byte offset. I have a few
reasons for this justification:
 1. Unicode is *hard*, and it has a lot of terms that people aren't
always familiar with (code points, glyphs, encodings, and the worst
overloaded term "characters"). "a byte offset in UTF-8" should be
universally and unambiguously understood.
 2. Even if you specified the cursor offset as an index into a UTF-32
array of codepoints, you *still* could end up with the cursor "in
between" a printed glyph due to combining diactiricals.
 3. Due to UTF-8's self syncronizing encoding, it is actually very
easy to determine if a given byte is the start of a code point, or in
the middle (and even determine *which* byte in the sequence it is).
Consequently, if you do find the offset is in the middle of a
codepoint, it is pretty trivial to either move to the next code point,
or move back to the beginning of the current code point. As such, I
have always found byte a more useful offset, because it can more
easily be converted to a code point than the other way around.
 4. As more of a "gut feel" sort of thing.... A Wayland protocol is a
pretty well defined binary API (like a networking API...), and
specifying in bytes feels more "stable"... Sorry I really don't have
solid data to back that up, but I would need a lot of convincing that
codepoints were better if someone was proposing throwing this data in
a UDP packet and blasting it across a network :)

Thanks,
Joshua Watt

>
> Cheers,
> Dorota
>
> _______________________________________________
> wayland-devel mailing list
> wayland-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/wayland-devel
>
_______________________________________________
wayland-devel mailing list
wayland-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/wayland-devel

Reply via email to