Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol
On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote: > On Fri, 4 May 2018 22:32:15 +0200 > Silvan Jegenwrote: > > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote: > > > On Thu, 3 May 2018 21:55:40 +0200 > > > Silvan Jegen wrote: > > > > > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote: > > > > > On Thu, 3 May 2018 20:47:27 +0200 > > > > > Silvan Jegen wrote: > > > > > > > > > > > Hi Dorota > > > > > > > > > > > > Some comments and typo fixes below. > > > > > > > > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz > > > > > > wrote: > > > > > > > + Text is valid UTF-8 encoded, indices and lengths are in > > > > > > > code points. If a > > > > > > > + grapheme is made up of multiple code points, an index > > > > > > > pointing to any of > > > > > > > + them should be interpreted as pointing to the first one. > > > > > > > > > > > > > > > > > > > That way we make sure we don't put the cursor/anchor between bytes > > > > > > that > > > > > > belong to the same UTF-8 encoded Unicode code point which is nice. > > > > > > It > > > > > > also means that the client has to parse all the UTF-8 encoded > > > > > > strings > > > > > > into Unicode code points up to the desired cursor/anchor position > > > > > > on each "preedit_string" event. For each "delete_surrounding_text" > > > > > > event > > > > > > the client has to parse the UTF-8 sequences before and after the > > > > > > cursor > > > > > > position up to the requested Unicode code point. > > > > > > > > > > > > I feel like we are processing the UTF-8 string already in the > > > > > > input-method. So I am not sure that we should parse it again on the > > > > > > client side. Parsing it again would also mean that the client would > > > > > > need > > > > > > to know about UTF-8 which would be nice to avoid. > > > > > > > > > > > > Thoughts? > > > > > > > > > > The client needs to know about Unicode, but not necessarily about > > > > > UTF-8. Specifying code points is actually an advantage here, because > > > > > byte offsets are inherently expressed relative to UTF-8. By counting > > > > > with code points, client's internal representation can be UTF-16 or > > > > > maybe even something else. > > > > > > > > Maybe I am misunderstanding something but the protocol specifies that > > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into > > > > the strings are specified in Unicode points. To me that indicates that > > > > the application *has to parse* the UTF-8 string into Unicode points > > > > when receiving the event otherwise it doesn't know after which Unicode > > > > point to draw the cursor. Of course the application can then decide to > > > > convert the UTF-8 string into another encoding like UTF-16 for internal > > > > processing (for whatever reason) but that doesn't change the fact that > > > > it still would have to parse the incoming UTF-8 (and thus know about > > > > UTF-8). > > > > > > > Can you see any way to avoid parsing UTF-8 in order to draw the > > > cursor? I tried to come up with a way to do that, but even with > > > specifying byte strings, I believe that calculating the position of > > > the cursor - either in pixels or in glyphs - requires full parsing of > > > the input string. > > > > Yes, I don't think it's avoidable either. You just don't have to do > > it twice if your text rendering library consumes UTF-8 strings with > > byte-offsets though. See my response below. > > > > > > > > > There's no avoiding the parsing either. What the application cares > > > > > about is that the cursor falls between glyphs. The application cannot > > > > > know that in all cases. Unicode allows the same sequence to be > > > > > displayed in multiple ways (fallback): > > > > > > > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html > > > > > > > > > > One could make an argument that byte offsets should never be close > > > > > to ZWJ characters, but I think this decision is better left to the > > > > > application, which knows what exactly it is presenting to the user. > > > > > > > > > > > > > The idea of the previous version of the protocol (from my understanding) > > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not > > > > falling between bytes of a Unicode code point) into the string will be > > > > sent to the client. If you just get a byte-offset into a UTF-8 encoded > > > > string you trust the sender to honor the protocol and thus you can just > > > > pass the UTF-8 encoded string unprocessed to your text rendering library > > > > (provided that the library supports UTF-8 strings which is what I am > > > > assuming) without having to parse the UTF-8 string into Unicode code > > > > points. > > > > > > > > Of course the Unicode code points will have to be parsed at
Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol
On Fri, 4 May 2018 22:32:15 +0200 Silvan Jegenwrote: > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote: > > On Thu, 3 May 2018 21:55:40 +0200 > > Silvan Jegen wrote: > > > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote: > > > > On Thu, 3 May 2018 20:47:27 +0200 > > > > Silvan Jegen wrote: > > > > > > > > > Hi Dorota > > > > > > > > > > Some comments and typo fixes below. > > > > > > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote: > > > > > > > > > > > + Text is valid UTF-8 encoded, indices and lengths are in code > > > > > > points. If a > > > > > > + grapheme is made up of multiple code points, an index > > > > > > pointing to any of > > > > > > + them should be interpreted as pointing to the first one. > > > > > > > > > > > > > > > > That way we make sure we don't put the cursor/anchor between bytes > > > > > that > > > > > belong to the same UTF-8 encoded Unicode code point which is nice. It > > > > > also means that the client has to parse all the UTF-8 encoded strings > > > > > into Unicode code points up to the desired cursor/anchor position > > > > > on each "preedit_string" event. For each "delete_surrounding_text" > > > > > event > > > > > the client has to parse the UTF-8 sequences before and after the > > > > > cursor > > > > > position up to the requested Unicode code point. > > > > > > > > > > I feel like we are processing the UTF-8 string already in the > > > > > input-method. So I am not sure that we should parse it again on the > > > > > client side. Parsing it again would also mean that the client would > > > > > need > > > > > to know about UTF-8 which would be nice to avoid. > > > > > > > > > > Thoughts? > > > > > > > > The client needs to know about Unicode, but not necessarily about > > > > UTF-8. Specifying code points is actually an advantage here, because > > > > byte offsets are inherently expressed relative to UTF-8. By counting > > > > with code points, client's internal representation can be UTF-16 or > > > > maybe even something else. > > > > > > Maybe I am misunderstanding something but the protocol specifies that > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into > > > the strings are specified in Unicode points. To me that indicates that > > > the application *has to parse* the UTF-8 string into Unicode points > > > when receiving the event otherwise it doesn't know after which Unicode > > > point to draw the cursor. Of course the application can then decide to > > > convert the UTF-8 string into another encoding like UTF-16 for internal > > > processing (for whatever reason) but that doesn't change the fact that > > > it still would have to parse the incoming UTF-8 (and thus know about > > > UTF-8). > > > > > Can you see any way to avoid parsing UTF-8 in order to draw the > > cursor? I tried to come up with a way to do that, but even with > > specifying byte strings, I believe that calculating the position of > > the cursor - either in pixels or in glyphs - requires full parsing of > > the input string. > > Yes, I don't think it's avoidable either. You just don't have to do > it twice if your text rendering library consumes UTF-8 strings with > byte-offsets though. See my response below. > > > > > > There's no avoiding the parsing either. What the application cares > > > > about is that the cursor falls between glyphs. The application cannot > > > > know that in all cases. Unicode allows the same sequence to be > > > > displayed in multiple ways (fallback): > > > > > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html > > > > > > > > One could make an argument that byte offsets should never be close > > > > to ZWJ characters, but I think this decision is better left to the > > > > application, which knows what exactly it is presenting to the user. > > > > > > The idea of the previous version of the protocol (from my understanding) > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not > > > falling between bytes of a Unicode code point) into the string will be > > > sent to the client. If you just get a byte-offset into a UTF-8 encoded > > > string you trust the sender to honor the protocol and thus you can just > > > pass the UTF-8 encoded string unprocessed to your text rendering library > > > (provided that the library supports UTF-8 strings which is what I am > > > assuming) without having to parse the UTF-8 string into Unicode code > > > points. > > > > > > Of course the Unicode code points will have to be parsed at some point > > > if you want to render them. Using byte-offsets just lets you do that at > > > a later stage if your libraries support UTF-8. > > > > > > > > Doesn't that chiefly depend on what kind of the text rendering library > > though? As far as I understand, passing text to