Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-05 Thread Silvan Jegen
On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:
> On Fri, 4 May 2018 22:32:15 +0200
> Silvan Jegen  wrote:
> 
> > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
> > > On Thu, 3 May 2018 21:55:40 +0200
> > > Silvan Jegen  wrote:
> > >   
> > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:  
> > > > > On Thu, 3 May 2018 20:47:27 +0200
> > > > > Silvan Jegen  wrote:
> > > > > 
> > > > > > Hi Dorota
> > > > > > 
> > > > > > Some comments and typo fixes below.
> > > > > > 
> > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz 
> > > > > > wrote:
> > > > > > > +  Text is valid UTF-8 encoded, indices and lengths are in 
> > > > > > > code points. If a
> > > > > > > +  grapheme is made up of multiple code points, an index 
> > > > > > > pointing to any of
> > > > > > > +  them should be interpreted as pointing to the first one.   
> > > > > > >
> > > > > > 
> > > > > > That way we make sure we don't put the cursor/anchor between bytes 
> > > > > > that
> > > > > > belong to the same UTF-8 encoded Unicode code point which is nice. 
> > > > > > It
> > > > > > also means that the client has to parse all the UTF-8 encoded 
> > > > > > strings
> > > > > > into Unicode code points up to the desired cursor/anchor position
> > > > > > on each "preedit_string" event. For each "delete_surrounding_text" 
> > > > > > event
> > > > > > the client has to parse the UTF-8 sequences before and after the 
> > > > > > cursor
> > > > > > position up to the requested Unicode code point.
> > > > > > 
> > > > > > I feel like we are processing the UTF-8 string already in the
> > > > > > input-method. So I am not sure that we should parse it again on the
> > > > > > client side. Parsing it again would also mean that the client would 
> > > > > > need
> > > > > > to know about UTF-8 which would be nice to avoid.
> > > > > > 
> > > > > > Thoughts?
> > > > > 
> > > > > The client needs to know about Unicode, but not necessarily about
> > > > > UTF-8. Specifying code points is actually an advantage here, because
> > > > > byte offsets are inherently expressed relative to UTF-8. By counting
> > > > > with code points, client's internal representation can be UTF-16 or
> > > > > maybe even something else.
> > > > 
> > > > Maybe I am misunderstanding something but the protocol specifies that
> > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> > > > the strings are specified in Unicode points. To me that indicates that
> > > > the application *has to parse* the UTF-8 string into Unicode points
> > > > when receiving the event otherwise it doesn't know after which Unicode
> > > > point to draw the cursor. Of course the application can then decide to
> > > > convert the UTF-8 string into another encoding like UTF-16 for internal
> > > > processing (for whatever reason) but that doesn't change the fact that
> > > > it still would have to parse the incoming UTF-8 (and thus know about
> > > > UTF-8).
> > > >   
> > > Can you see any way to avoid parsing UTF-8 in order to draw the
> > > cursor? I tried to come up with a way to do that, but even with
> > > specifying byte strings, I believe that calculating the position of
> > > the cursor - either in pixels or in glyphs - requires full parsing of
> > > the input string.  
> > 
> > Yes, I don't think it's avoidable either. You just don't have to do
> > it twice if your text rendering library consumes UTF-8 strings with
> > byte-offsets though. See my response below.
> > 
> > 
> > > > > There's no avoiding the parsing either. What the application cares
> > > > > about is that the cursor falls between glyphs. The application cannot
> > > > > know that in all cases. Unicode allows the same sequence to be
> > > > > displayed in multiple ways (fallback):
> > > > > 
> > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > > > > 
> > > > > One could make an argument that byte offsets should never be close
> > > > > to ZWJ characters, but I think this decision is better left to the
> > > > > application, which knows what exactly it is presenting to the user.   
> > > > >  
> > > > 
> > > > The idea of the previous version of the protocol (from my understanding)
> > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> > > > falling between bytes of a Unicode code point) into the string will be
> > > > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> > > > string you trust the sender to honor the protocol and thus you can just
> > > > pass the UTF-8 encoded string unprocessed to your text rendering library
> > > > (provided that the library supports UTF-8 strings which is what I am
> > > > assuming) without having to parse the UTF-8 string into Unicode code
> > > > points.
> > > > 
> > > > Of course the Unicode code points will have to be parsed at 

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-05 Thread Dorota Czaplejewicz
On Fri, 4 May 2018 22:32:15 +0200
Silvan Jegen  wrote:

> On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
> > On Thu, 3 May 2018 21:55:40 +0200
> > Silvan Jegen  wrote:
> >   
> > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:  
> > > > On Thu, 3 May 2018 20:47:27 +0200
> > > > Silvan Jegen  wrote:
> > > > 
> > > > > Hi Dorota
> > > > > 
> > > > > Some comments and typo fixes below.
> > > > > 
> > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:  
> > > > >   
> > > > > > +  Text is valid UTF-8 encoded, indices and lengths are in code 
> > > > > > points. If a
> > > > > > +  grapheme is made up of multiple code points, an index 
> > > > > > pointing to any of
> > > > > > +  them should be interpreted as pointing to the first one. 
> > > > > >  
> > > > > 
> > > > > That way we make sure we don't put the cursor/anchor between bytes 
> > > > > that
> > > > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > > > > also means that the client has to parse all the UTF-8 encoded strings
> > > > > into Unicode code points up to the desired cursor/anchor position
> > > > > on each "preedit_string" event. For each "delete_surrounding_text" 
> > > > > event
> > > > > the client has to parse the UTF-8 sequences before and after the 
> > > > > cursor
> > > > > position up to the requested Unicode code point.
> > > > > 
> > > > > I feel like we are processing the UTF-8 string already in the
> > > > > input-method. So I am not sure that we should parse it again on the
> > > > > client side. Parsing it again would also mean that the client would 
> > > > > need
> > > > > to know about UTF-8 which would be nice to avoid.
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > The client needs to know about Unicode, but not necessarily about
> > > > UTF-8. Specifying code points is actually an advantage here, because
> > > > byte offsets are inherently expressed relative to UTF-8. By counting
> > > > with code points, client's internal representation can be UTF-16 or
> > > > maybe even something else.
> > > 
> > > Maybe I am misunderstanding something but the protocol specifies that
> > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> > > the strings are specified in Unicode points. To me that indicates that
> > > the application *has to parse* the UTF-8 string into Unicode points
> > > when receiving the event otherwise it doesn't know after which Unicode
> > > point to draw the cursor. Of course the application can then decide to
> > > convert the UTF-8 string into another encoding like UTF-16 for internal
> > > processing (for whatever reason) but that doesn't change the fact that
> > > it still would have to parse the incoming UTF-8 (and thus know about
> > > UTF-8).
> > >   
> > Can you see any way to avoid parsing UTF-8 in order to draw the
> > cursor? I tried to come up with a way to do that, but even with
> > specifying byte strings, I believe that calculating the position of
> > the cursor - either in pixels or in glyphs - requires full parsing of
> > the input string.  
> 
> Yes, I don't think it's avoidable either. You just don't have to do
> it twice if your text rendering library consumes UTF-8 strings with
> byte-offsets though. See my response below.
> 
> 
> > > > There's no avoiding the parsing either. What the application cares
> > > > about is that the cursor falls between glyphs. The application cannot
> > > > know that in all cases. Unicode allows the same sequence to be
> > > > displayed in multiple ways (fallback):
> > > > 
> > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > > > 
> > > > One could make an argument that byte offsets should never be close
> > > > to ZWJ characters, but I think this decision is better left to the
> > > > application, which knows what exactly it is presenting to the user.
> > > 
> > > The idea of the previous version of the protocol (from my understanding)
> > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> > > falling between bytes of a Unicode code point) into the string will be
> > > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> > > string you trust the sender to honor the protocol and thus you can just
> > > pass the UTF-8 encoded string unprocessed to your text rendering library
> > > (provided that the library supports UTF-8 strings which is what I am
> > > assuming) without having to parse the UTF-8 string into Unicode code
> > > points.
> > > 
> > > Of course the Unicode code points will have to be parsed at some point
> > > if you want to render them. Using byte-offsets just lets you do that at
> > > a later stage if your libraries support UTF-8.
> > > 
> > >   
> > Doesn't that chiefly depend on what kind of the text rendering library
> > though? As far as I understand, passing text to