Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-07-23 Thread Dorota Czaplejewicz
Hi Carlos,

thanks for reviewing!

On Tue, 17 Jul 2018 19:18:36 +0200
Carlos Garnacho  wrote:

> Hi!,
> 
> (Way way late, trying to revive the conversation...)
> 
> On Thu, May 3, 2018 at 9:22 PM, Dorota Czaplejewicz
>  wrote:
> > On Thu, 3 May 2018 20:47:27 +0200
> > Silvan Jegen  wrote:
> >  
> >> Hi Dorota
> >>
> >> Some comments and typo fixes below.
> >>
> >> On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:  
> >> > This new protocol description is a simplification over v2.
> >> >
> >> > - All pre-edit text styling is gone.
> >> > - Pre-edit cursor can span characters.
> >> > - No events regarding input panel (OSK) state nor covered rectangle.
> >> >   Compositors are still free to handle situations where the keyboard
> >> >   focus rectangle is covered by the input panel.
> >> > - No set_preferred_language request for clients.
> >> > - There is no event to send keysyms. Compositors can use wl_keyboard
> >> >   interface instead.
> >> > - All state is double-buffered, with specified state.
> >> > - Use Unicode codepoints to measure strings.
> >> >
> >> > Signed-off-by: Dorota Czaplejewicz 
> >> > Signed-off-by: Carlos Garnacho 
> >> > ---
> >> > This is the next update coming from Purism to perfect the text input 
> >> > protocol.
> >> >
> >> > The following changes added on top of PATCHv3:
> >> >
> >> > - Fixed whitespaces.
> >> > - Removed enable flags - the same information can be gathered from the 
> >> > first requests after enter.
> >> > - Changed offsets inside UTF-8 strings to use Unicode character counts 
> >> > in order to remove the possibility of communicating invalid state.
> >> > - Specified the exact lifetime of double-buffered state, and initial 
> >> > values.
> >> > - Made changes requested by the IM double-buffered.
> >> >
> >> > Some questions remain open. One is: how to specify how much text to 
> >> > capture in set_surrounding_text, and how often to update?  
> 
> IMHO the only reason to state it here is that it's more likely that a
> lazy implementation will try to squeeze a full book here, than eg. an
> application setting an insanely long title. But certainly other
> messages across protocols may hit this limit (the long title issue
> wasn't made up :).
> 
> As for how much, I think it ultimately depends on the IM behind. Text
> correction probably just wants the current word, any sort of
> prediction will probably require phrases to paragraphs, char
> composition can probably do without. Sounds like this could be some
> sort of hint, but I don't think IMs can tell you today how much text
> do they want...
> 
> >> >
> >> > A possible change that I decided against for now is to replace 
> >> > enable/disable events by create/destroy of a new object, which would 
> >> > make more state lifetimes encoded in the protocol.
> >> >
> >> > After reading a blog post on fcitx [0], I got the impression that 
> >> > letting the compositor know some persistent ID of a text edit instance 
> >> > could be useful, however I'm not sure what the use cases are.
> >> >
> >> > As always, I'm happy to hear feedback.
> >> >
> >> > Cheers,
> >> > Dorota Czaplejewicz
> >> >
> >> > [0] 
> >> > https://www.csslayer.info/wordpress/fcitx-dev/gaps-between-wayland-and-fcitx-or-all-input-methods/
> >> >
> >> >  Makefile.am|   1 +
> >> >  unstable/text-input/text-input-unstable-v3.xml | 362 
> >> > +
> >> >  2 files changed, 363 insertions(+)
> >> >  create mode 100644 unstable/text-input/text-input-unstable-v3.xml
> >> >
> >> > diff --git a/Makefile.am b/Makefile.am
> >> > index 4b9a901..86d7ca9 100644
> >> > --- a/Makefile.am
> >> > +++ b/Makefile.am
> >> > @@ -3,6 +3,7 @@ unstable_protocols = 
> >> >\
> >> > unstable/fullscreen-shell/fullscreen-shell-unstable-v1.xml   
> >> >\
> >> > unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml   
> >> >\
> >> > unstable/text-input/text-input-unstable-v1.xml   
> >> >\
> >> > +   unstable/text-input/text-input-unstable-v3.xml   
> >> >\
> >> > unstable/input-method/input-method-unstable-v1.xml   
> >> >\
> >> > unstable/xdg-shell/xdg-shell-unstable-v5.xml 
> >> >\
> >> > unstable/xdg-shell/xdg-shell-unstable-v6.xml 
> >> >\
> >> > diff --git a/unstable/text-input/text-input-unstable-v3.xml 
> >> > b/unstable/text-input/text-input-unstable-v3.xml
> >> > new file mode 100644
> >> > index 000..ed5204f
> >> > --- /dev/null
> >> > +++ b/unstable/text-input/text-input-unstable-v3.xml
> >> > @@ -0,0 +1,362 @@
> >> > +
> >> > +
> >> > +
> >> > +  
> >> > +Copyright © 2012, 2013 Intel Corporation
> >> > +Copyright © 2015, 2016 Jan Arne Petersen
> >> > +Copyright © 2017, 2018 Red Hat, Inc.
> >> > +Copyright © 2018 Purism SPC
> >> > +
> 

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-07-17 Thread Carlos Garnacho
Hi!,

(Way way late, trying to revive the conversation...)

On Thu, May 3, 2018 at 9:22 PM, Dorota Czaplejewicz
 wrote:
> On Thu, 3 May 2018 20:47:27 +0200
> Silvan Jegen  wrote:
>
>> Hi Dorota
>>
>> Some comments and typo fixes below.
>>
>> On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:
>> > This new protocol description is a simplification over v2.
>> >
>> > - All pre-edit text styling is gone.
>> > - Pre-edit cursor can span characters.
>> > - No events regarding input panel (OSK) state nor covered rectangle.
>> >   Compositors are still free to handle situations where the keyboard
>> >   focus rectangle is covered by the input panel.
>> > - No set_preferred_language request for clients.
>> > - There is no event to send keysyms. Compositors can use wl_keyboard
>> >   interface instead.
>> > - All state is double-buffered, with specified state.
>> > - Use Unicode codepoints to measure strings.
>> >
>> > Signed-off-by: Dorota Czaplejewicz 
>> > Signed-off-by: Carlos Garnacho 
>> > ---
>> > This is the next update coming from Purism to perfect the text input 
>> > protocol.
>> >
>> > The following changes added on top of PATCHv3:
>> >
>> > - Fixed whitespaces.
>> > - Removed enable flags - the same information can be gathered from the 
>> > first requests after enter.
>> > - Changed offsets inside UTF-8 strings to use Unicode character counts in 
>> > order to remove the possibility of communicating invalid state.
>> > - Specified the exact lifetime of double-buffered state, and initial 
>> > values.
>> > - Made changes requested by the IM double-buffered.
>> >
>> > Some questions remain open. One is: how to specify how much text to 
>> > capture in set_surrounding_text, and how often to update?

IMHO the only reason to state it here is that it's more likely that a
lazy implementation will try to squeeze a full book here, than eg. an
application setting an insanely long title. But certainly other
messages across protocols may hit this limit (the long title issue
wasn't made up :).

As for how much, I think it ultimately depends on the IM behind. Text
correction probably just wants the current word, any sort of
prediction will probably require phrases to paragraphs, char
composition can probably do without. Sounds like this could be some
sort of hint, but I don't think IMs can tell you today how much text
do they want...

>> >
>> > A possible change that I decided against for now is to replace 
>> > enable/disable events by create/destroy of a new object, which would make 
>> > more state lifetimes encoded in the protocol.
>> >
>> > After reading a blog post on fcitx [0], I got the impression that letting 
>> > the compositor know some persistent ID of a text edit instance could be 
>> > useful, however I'm not sure what the use cases are.
>> >
>> > As always, I'm happy to hear feedback.
>> >
>> > Cheers,
>> > Dorota Czaplejewicz
>> >
>> > [0] 
>> > https://www.csslayer.info/wordpress/fcitx-dev/gaps-between-wayland-and-fcitx-or-all-input-methods/
>> >
>> >  Makefile.am|   1 +
>> >  unstable/text-input/text-input-unstable-v3.xml | 362 
>> > +
>> >  2 files changed, 363 insertions(+)
>> >  create mode 100644 unstable/text-input/text-input-unstable-v3.xml
>> >
>> > diff --git a/Makefile.am b/Makefile.am
>> > index 4b9a901..86d7ca9 100644
>> > --- a/Makefile.am
>> > +++ b/Makefile.am
>> > @@ -3,6 +3,7 @@ unstable_protocols =   
>> >  \
>> > unstable/fullscreen-shell/fullscreen-shell-unstable-v1.xml 
>> >  \
>> > unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml 
>> >  \
>> > unstable/text-input/text-input-unstable-v1.xml 
>> >  \
>> > +   unstable/text-input/text-input-unstable-v3.xml 
>> >  \
>> > unstable/input-method/input-method-unstable-v1.xml 
>> >  \
>> > unstable/xdg-shell/xdg-shell-unstable-v5.xml   
>> >  \
>> > unstable/xdg-shell/xdg-shell-unstable-v6.xml   
>> >  \
>> > diff --git a/unstable/text-input/text-input-unstable-v3.xml 
>> > b/unstable/text-input/text-input-unstable-v3.xml
>> > new file mode 100644
>> > index 000..ed5204f
>> > --- /dev/null
>> > +++ b/unstable/text-input/text-input-unstable-v3.xml
>> > @@ -0,0 +1,362 @@
>> > +
>> > +
>> > +
>> > +  
>> > +Copyright © 2012, 2013 Intel Corporation
>> > +Copyright © 2015, 2016 Jan Arne Petersen
>> > +Copyright © 2017, 2018 Red Hat, Inc.
>> > +Copyright © 2018 Purism SPC
>> > +
>> > +Permission to use, copy, modify, distribute, and sell this
>> > +software and its documentation for any purpose is hereby granted
>> > +without fee, provided that the above copyright notice appear in
>> > +all copies and that both that copyright notice and this permission
>> > +notice appear in supporting 

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-17 Thread Dorota Czaplejewicz
On Thu, 17 May 2018 18:05:34 +0100
Daniel Stone  wrote:

> Hi Dorota,
> 
> On 3 May 2018 at 16:41, Dorota Czaplejewicz  
> wrote:
> > - There is no event to send keysyms. Compositors can use wl_keyboard
> >   interface instead.  
> 
> The reason we explicitly chose to have a keysym (really, 'Unicode
> codepoint') event, is to support characters which don't appear in any
> keymap. As a trivial example, emoji keyboards will want to send
> symbols which appear in no sane keymap. Similarly, CJK input methods
> may offer streams of characters pre-composed from component runs; it
> is not practical to insert the entire CJK unicode space into a keymap.
> 
> Cheers,
> Daniel


Hi Daniel,

I think that anyone wanting to support inserting arbitrary Unicode characters 
should use the text composition requests instead (commit_string and friends). 
Input methods, especially CJK ones, will make use of that functionality anyway. 
If removing keysyms makes something impossible, I would rather fix the text 
composition portion of the protocol.

Cheers,
Dorota


pgpVUsHmP0Hy3.pgp
Description: OpenPGP digital signature
___
wayland-devel mailing list
wayland-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/wayland-devel


Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-10 Thread Silvan Jegen
On Thu, May 10, 2018 at 11:46:32AM +0200, Dorota Czaplejewicz wrote:
> On Thu, 10 May 2018 11:43:12 +0200
> Dorota Czaplejewicz  wrote:
> 
> > On Tue, 08 May 2018 07:07:24 +
> > Silvan Jegen  wrote:
> > 
> > > On Mon, May 7, 2018 at 5:11 AM Joshua Watt  wrote:  
> > > > IMHO, if you are doing UTF-8 (which you should), you should *always*
> > > > specify any offset in the string as a byte offset. I have a few
> > > > reasons for this justification:
> > > 
> > > I agree with this as well. I thought some more about how to spell out my
> > > gut feeling on this matter in more technical terms.
> > > 
> > > UTF-8 is a byte (sequence) representation of Unicode code points. This
> > > indicates to me that an offset within an UTF-8-encoded string should also
> > > be given in bytes. Specifying the offset in Unicode points mixes the
> > > abstraction of the Unicode code point with (one of) its representations as
> > > a byte sequence. This is reflected in the fact that an offset in Unicode
> > > code points is not applicable to the UTF-8 string without first processing
> > > the string.
> > > 
> > > Unicode code points do not give us that much either since what we most
> > > likely want are grapheme clusters anyway (which, like any more advanced
> > > Unicode processing, should be handled by a specialised library):
> > > http://utf8everywhere.org/#myth.strlen
> > > 
> > > 
> > > Cheers,
> > > 
> > > Silvan  
> > 
> > This message made me feel obliged to turn my own gut feeling into
> > words. This is not to be construed as an argument, but more of an
> > explanation.
> > 
> > I view wayland protocols as rather high level: their responsibility
> > is to specify the type and the purpose of the data they are
> > transporting. In this case, the data is a Unicode string, and the
> > purpose is display. Or, the data is a number and the purpose is
> > indexing.
> > 
> > I think that when a protocol starts to specify the type and purpose,
> > it can no longer be thought as high level. In this view, indexing a
> > Unicode string in terms of bytes would be akin to indexing any other
> > vector of Foo in bytes. (I didn't actually check if there is any
> > other vector, or bytes type available in wayland).
> > 
> > As you noted, there is some mixing between abstraction levels in
> > the protocol. Hardcoding that it's not *just* Unicode, but also the
> > particular encoding (UTF-8) eliminates problems with byte indexing
> > we would have encountered if we decided to use things like Punycode
> > (München => Mnchen-3ya). Knowing that it's always UTF-8 allows the
> > protocol to use a tailoring indexing scheme. While I consider this a
> > layer-breaking hack, nevertheless, this property partially counters
> > the above reasoning.
> > 
> > * * *
> > 
> > To be honest, neither Unicode code points nor graphemes nor clusters
> > are what we're truly looking for here. To understand what I mean, I
> > recommend to play with this grapheme cluster:
> > 
> > नमस्ते
> > 
> > According to the Rust book [0], it's composed of 6 code points:
> > ['न', 'म', 'स', '्', 'त', 'े'], but moving the cursor
> > around, I would be led to believe it's 4 "pieces" long only.
> > 
> > Cheers,
> > Dorota
> > 
> > [0] https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html
> 
> On a second thought, perhaps graphemes are actually the relevant thing here...

Yes, that's also mentioned in the rust book:

https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html#bytes-and-scalar-values-and-grapheme-clusters-oh-my

and what I mentioned in my mail.

I agree with what is mentioned in http://utf8everywhere.org/#myth.strlen
which is that Unicode code points are almost never what people making
use of the protocol would want:

"Yet, the number of code points in it is irrelevant to almost any software
engineering task, with perhaps the only exception of converting the
string to UTF-32"

So instead just specifying a byte offset (thus not mixing layers of
abstraction) and leaving more specialized Unicode handling (if desired by
the client) to specialized libraries seems like the best way to go.


Cheers,

Silvan


signature.asc
Description: PGP signature
___
wayland-devel mailing list
wayland-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/wayland-devel


Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-10 Thread Dorota Czaplejewicz
On Thu, 10 May 2018 11:43:12 +0200
Dorota Czaplejewicz  wrote:

> On Tue, 08 May 2018 07:07:24 +
> Silvan Jegen  wrote:
> 
> > On Mon, May 7, 2018 at 5:11 AM Joshua Watt  wrote:  
> > > IMHO, if you are doing UTF-8 (which you should), you should *always*
> > > specify any offset in the string as a byte offset. I have a few
> > > reasons for this justification:
> > 
> > I agree with this as well. I thought some more about how to spell out my
> > gut feeling on this matter in more technical terms.
> > 
> > UTF-8 is a byte (sequence) representation of Unicode code points. This
> > indicates to me that an offset within an UTF-8-encoded string should also
> > be given in bytes. Specifying the offset in Unicode points mixes the
> > abstraction of the Unicode code point with (one of) its representations as
> > a byte sequence. This is reflected in the fact that an offset in Unicode
> > code points is not applicable to the UTF-8 string without first processing
> > the string.
> > 
> > Unicode code points do not give us that much either since what we most
> > likely want are grapheme clusters anyway (which, like any more advanced
> > Unicode processing, should be handled by a specialised library):
> > http://utf8everywhere.org/#myth.strlen
> > 
> > 
> > Cheers,
> > 
> > Silvan  
> 
> This message made me feel obliged to turn my own gut feeling into words. This 
> is not to be construed as an argument, but more of an explanation.
> 
> I view wayland protocols as rather high level: their responsibility is to 
> specify the type and the purpose of the data they are transporting. In this 
> case, the data is a Unicode string, and the purpose is display. Or, the data 
> is a number and the purpose is indexing.
> 
> I think that when a protocol starts to specify the type and purpose, it can 
> no longer be thought as high level. In this view, indexing a Unicode string 
> in terms of bytes would be akin to indexing any other vector of Foo in bytes. 
> (I didn't actually check if there is any other vector, or bytes type 
> available in wayland).
> 
> As you noted, there is some mixing between abstraction levels in the 
> protocol. Hardcoding that it's not *just* Unicode, but also the particular 
> encoding (UTF-8) eliminates problems with byte indexing we would have 
> encountered if we decided to use things like Punycode (München => 
> Mnchen-3ya). Knowing that it's always UTF-8 allows the protocol to use a 
> tailoring indexing scheme. While I consider this a layer-breaking hack, 
> nevertheless, this property partially counters the above reasoning.
> 
> * * *
> 
> To be honest, neither Unicode code points nor graphemes nor clusters are what 
> we're truly looking for here. To understand what I mean, I recommend to play 
> with this grapheme cluster:
> 
> नमस्ते
> 
> According to the Rust book [0], it's composed of 6 code points: ['न', 'म', 
> 'स', '्', 'त', 'े'], but moving the cursor around, I would be led to believe 
> it's 4 "pieces" long only.
> 
> Cheers,
> Dorota
> 
> [0] https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html

On a second thought, perhaps graphemes are actually the relevant thing here...


pgpM9K5WOPO5U.pgp
Description: OpenPGP digital signature
___
wayland-devel mailing list
wayland-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/wayland-devel


Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-10 Thread Dorota Czaplejewicz
On Tue, 08 May 2018 07:07:24 +
Silvan Jegen  wrote:

> On Mon, May 7, 2018 at 5:11 AM Joshua Watt  wrote:
> > IMHO, if you are doing UTF-8 (which you should), you should *always*
> > specify any offset in the string as a byte offset. I have a few
> > reasons for this justification:  
> 
> I agree with this as well. I thought some more about how to spell out my
> gut feeling on this matter in more technical terms.
> 
> UTF-8 is a byte (sequence) representation of Unicode code points. This
> indicates to me that an offset within an UTF-8-encoded string should also
> be given in bytes. Specifying the offset in Unicode points mixes the
> abstraction of the Unicode code point with (one of) its representations as
> a byte sequence. This is reflected in the fact that an offset in Unicode
> code points is not applicable to the UTF-8 string without first processing
> the string.
> 
> Unicode code points do not give us that much either since what we most
> likely want are grapheme clusters anyway (which, like any more advanced
> Unicode processing, should be handled by a specialised library):
> http://utf8everywhere.org/#myth.strlen
> 
> 
> Cheers,
> 
> Silvan

This message made me feel obliged to turn my own gut feeling into words. This 
is not to be construed as an argument, but more of an explanation.

I view wayland protocols as rather high level: their responsibility is to 
specify the type and the purpose of the data they are transporting. In this 
case, the data is a Unicode string, and the purpose is display. Or, the data is 
a number and the purpose is indexing.

I think that when a protocol starts to specify the type and purpose, it can no 
longer be thought as high level. In this view, indexing a Unicode string in 
terms of bytes would be akin to indexing any other vector of Foo in bytes. (I 
didn't actually check if there is any other vector, or bytes type available in 
wayland).

As you noted, there is some mixing between abstraction levels in the protocol. 
Hardcoding that it's not *just* Unicode, but also the particular encoding 
(UTF-8) eliminates problems with byte indexing we would have encountered if we 
decided to use things like Punycode (München => Mnchen-3ya). Knowing that it's 
always UTF-8 allows the protocol to use a tailoring indexing scheme. While I 
consider this a layer-breaking hack, nevertheless, this property partially 
counters the above reasoning.

* * *

To be honest, neither Unicode code points nor graphemes nor clusters are what 
we're truly looking for here. To understand what I mean, I recommend to play 
with this grapheme cluster:

नमस्ते

According to the Rust book [0], it's composed of 6 code points: ['न', 'म', 'स', 
'्', 'त', 'े'], but moving the cursor around, I would be led to believe it's 4 
"pieces" long only.

Cheers,
Dorota

[0] https://doc.rust-lang.org/book/second-edition/ch08-02-strings.html


pgp5NljID7Inq.pgp
Description: OpenPGP digital signature
___
wayland-devel mailing list
wayland-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/wayland-devel


Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-08 Thread Silvan Jegen
On Mon, May 7, 2018 at 5:11 AM Joshua Watt  wrote:
> IMHO, if you are doing UTF-8 (which you should), you should *always*
> specify any offset in the string as a byte offset. I have a few
> reasons for this justification:

I agree with this as well. I thought some more about how to spell out my
gut feeling on this matter in more technical terms.

UTF-8 is a byte (sequence) representation of Unicode code points. This
indicates to me that an offset within an UTF-8-encoded string should also
be given in bytes. Specifying the offset in Unicode points mixes the
abstraction of the Unicode code point with (one of) its representations as
a byte sequence. This is reflected in the fact that an offset in Unicode
code points is not applicable to the UTF-8 string without first processing
the string.

Unicode code points do not give us that much either since what we most
likely want are grapheme clusters anyway (which, like any more advanced
Unicode processing, should be handled by a specialised library):
http://utf8everywhere.org/#myth.strlen


Cheers,

Silvan
___
wayland-devel mailing list
wayland-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/wayland-devel


Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-07 Thread Silvan Jegen
On Sun, May 06, 2018 at 10:37:57PM +0200, Dorota Czaplejewicz wrote:
> On Sat, 5 May 2018 13:37:44 +0200
> Silvan Jegen  wrote:
> 
> > On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:
> > > On Fri, 4 May 2018 22:32:15 +0200
> > > Silvan Jegen  wrote:
> > >   
> > > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:  
> > > > > On Thu, 3 May 2018 21:55:40 +0200
> > > > > Silvan Jegen  wrote:
> > >
> > > [...]
> > >
> > > In the end, I'm not an expert in that area either - perhaps treating
> > > client side strings as UTF-8 buffers makes sense, but at the moment
> > > I'm still leaning towards the code point abstraction.  
> > 
> > Someone (™) should probably implement a client making use of the protocol
> > to see what the real world impact of this protocol change would be.
> > 
> > The editor in the weston project uses pango for its text layout:
> > 
> > https://cgit.freedesktop.org/wayland/weston/tree/clients/editor.c#n824
> > 
> > so it would have to parse the UTF-8 string twice. The same is most likely
> > true for all programs using GTK...
> > 
> > 
> 
> I made an attempt to dig deeper, and while I stopped short of becoming
> this Someone for now, I gathered what I think are some important
> results.
> 
> First, the state of the libraries. There's a lot of data I gathered,
> so I'll keep this section rather dense. First, another contender
> for the title of text layout library, and that one uses code points
> exclusively:
> 
> https://github.com/silnrsi/graphite/blob/master/include/graphite2/Segment.h 
> `gr_make_seg`
> 
> https://github.com/silnrsi/graphite/blob/master/tests/examples/simple.c
> 
> Afterwards, I focused on GTK and Qt. As an input method plugin
> developer, I looked at the IM interfaces and internal data structures
> they expose. The results were not that clear - no mention of "code
> points", some references to "bytes", many to "characters" (not
> "chars"). What is certain is that there's a lot of converting going on

Yes, it's very unfortunate that a lot of developers do not strife for
more clarity and precision in terminology when processing text.


> behind the scenes anyway. First off, GTK seems to be moving away from
> bytes, judging by the comments:
> 
> gtk 3.22 (`gtkimcontext.c`)
> 
> `gtk_im_context_delete_surrounding`
> 
> > * Asks the widget that the input context is attached to to delete
> > * characters around the cursor position by emitting the
> > * GtkIMContext::delete_surrounding signal. Note that @offset and @n_chars
> > * are in characters not in bytes which differs from the usage other
> > * places in #GtkIMContext.
> 
> `gtk_im_context_get_preedit_string`
> 
> > * @cursor_pos: (out): location to store position of cursor (in characters)
> > *  within the preedit string.  
> 
> `gtk_im_context_get_surrounding`
> 
> > * @cursor_index: (out): location to store byte index of the insertion
> > *cursor within @text.
> 
> gtkEntry seems to store things internally as characters.

They mention "characters" but what they most likely mean are Unicode
code points.

One would think they would try to keep their APIs consistent but that
doesn't seem to be the case.


> While GTK using code points internally is not a proof of anything,
> it's a suggestion that there is a reason not to use bytes.
> 
> Then, Qt, from https://doc.qt.io/qt-5/qinputmethodevent.html#setCommitString
> 
> > replaceLength specifies the number of characters to be replaced
> 
> a confirmation that "characters" means "code points" comes from
> https://doc.qt.io/qt-5/qlineedit.html#cursorPosition-prop . The value
> reported when "æþ|" is displayed is 2.

https://doc.qt.io/qt-5/qstring.html

Qt uses UTF-16 internally so they *could* also be counting "QChars"
which are 16-bit (assuming the position is 0 indexed):

Python 3.6.5 (default, Apr 14 2018, 13:17:30)
[GCC 7.3.1 20180406] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "æþ"
'æþ'
>>> "æþ".encode("utf-16")
b'\xff\xfe\xe6\x00\xfe\x00'

If they are really doing that you would only notice it with characters
outside of the BMP because:

"(Unicode characters with code values above 65535 are stored using
surrogate pairs, i.e., two consecutive QChars.)"

I think everybody agrees that (Unicode) text handling is a mess in
general...


> I also spent more time than I should writing a demo implementation
> of an input method and a client connecting to it to check out the
> proposed interfaces. Predictably, it gave me a lot of trouble
> on the edges between bytes and code points, but I blame it on
> Rust's scarcity of UTF handling functions. The hack is available at
> https://code.puri.sm/dorota.czaplejewicz/impoc

Thanks for taking the time! I compiled and ran it but my rust is weak...

Rust has an interesting String type:

https://doc.rust-lang.org/std/string/struct.String.html#utf-8

It's UTF-8 encoded but you are 

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-07 Thread Dorota Czaplejewicz
On Sun, 6 May 2018 22:11:32 -0500
Joshua Watt  wrote:

> On Sun, May 6, 2018 at 3:37 PM, Dorota Czaplejewicz
>  wrote:
> > On Sat, 5 May 2018 13:37:44 +0200
> > Silvan Jegen  wrote:
> >  
> >> On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:  
> >> > On Fri, 4 May 2018 22:32:15 +0200
> >> > Silvan Jegen  wrote:
> >> >  
> >> > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:  
> >> > > > On Thu, 3 May 2018 21:55:40 +0200
> >> > > > Silvan Jegen  wrote:
> >> > > >  
> >> > > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz 
> >> > > > > wrote:  
> >> > > > > > On Thu, 3 May 2018 20:47:27 +0200
> >> > > > > > Silvan Jegen  wrote:
> >> > > > > >  
> >> > > > > > > Hi Dorota
> >> > > > > > >
> >> > > > > > > Some comments and typo fixes below.
> >> > > > > > >
> >> > > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz 
> >> > > > > > > wrote:  
> >> > > > > > > > +  Text is valid UTF-8 encoded, indices and lengths are 
> >> > > > > > > > in code points. If a
> >> > > > > > > > +  grapheme is made up of multiple code points, an index 
> >> > > > > > > > pointing to any of
> >> > > > > > > > +  them should be interpreted as pointing to the first 
> >> > > > > > > > one.  
> >> > > > > > >
> >> > > > > > > That way we make sure we don't put the cursor/anchor between 
> >> > > > > > > bytes that
> >> > > > > > > belong to the same UTF-8 encoded Unicode code point which is 
> >> > > > > > > nice. It
> >> > > > > > > also means that the client has to parse all the UTF-8 encoded 
> >> > > > > > > strings
> >> > > > > > > into Unicode code points up to the desired cursor/anchor 
> >> > > > > > > position
> >> > > > > > > on each "preedit_string" event. For each 
> >> > > > > > > "delete_surrounding_text" event
> >> > > > > > > the client has to parse the UTF-8 sequences before and after 
> >> > > > > > > the cursor
> >> > > > > > > position up to the requested Unicode code point.
> >> > > > > > >
> >> > > > > > > I feel like we are processing the UTF-8 string already in the
> >> > > > > > > input-method. So I am not sure that we should parse it again 
> >> > > > > > > on the
> >> > > > > > > client side. Parsing it again would also mean that the client 
> >> > > > > > > would need
> >> > > > > > > to know about UTF-8 which would be nice to avoid.
> >> > > > > > >
> >> > > > > > > Thoughts?  
> >> > > > > >
> >> > > > > > The client needs to know about Unicode, but not necessarily about
> >> > > > > > UTF-8. Specifying code points is actually an advantage here, 
> >> > > > > > because
> >> > > > > > byte offsets are inherently expressed relative to UTF-8. By 
> >> > > > > > counting
> >> > > > > > with code points, client's internal representation can be UTF-16 
> >> > > > > > or
> >> > > > > > maybe even something else.  
> >> > > > >
> >> > > > > Maybe I am misunderstanding something but the protocol specifies 
> >> > > > > that
> >> > > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets 
> >> > > > > into
> >> > > > > the strings are specified in Unicode points. To me that indicates 
> >> > > > > that
> >> > > > > the application *has to parse* the UTF-8 string into Unicode points
> >> > > > > when receiving the event otherwise it doesn't know after which 
> >> > > > > Unicode
> >> > > > > point to draw the cursor. Of course the application can then 
> >> > > > > decide to
> >> > > > > convert the UTF-8 string into another encoding like UTF-16 for 
> >> > > > > internal
> >> > > > > processing (for whatever reason) but that doesn't change the fact 
> >> > > > > that
> >> > > > > it still would have to parse the incoming UTF-8 (and thus know 
> >> > > > > about
> >> > > > > UTF-8).
> >> > > > >  
> >> > > > Can you see any way to avoid parsing UTF-8 in order to draw the
> >> > > > cursor? I tried to come up with a way to do that, but even with
> >> > > > specifying byte strings, I believe that calculating the position of
> >> > > > the cursor - either in pixels or in glyphs - requires full parsing of
> >> > > > the input string.  
> >> > >
> >> > > Yes, I don't think it's avoidable either. You just don't have to do
> >> > > it twice if your text rendering library consumes UTF-8 strings with
> >> > > byte-offsets though. See my response below.
> >> > >
> >> > >  
> >> > > > > > There's no avoiding the parsing either. What the application 
> >> > > > > > cares
> >> > > > > > about is that the cursor falls between glyphs. The application 
> >> > > > > > cannot
> >> > > > > > know that in all cases. Unicode allows the same sequence to be
> >> > > > > > displayed in multiple ways (fallback):
> >> > > > > >
> >> > > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> >> > > > > >
> >> > > > > > One could make an argument that byte offsets should never be 
> >> > > > > > close

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-06 Thread Joshua Watt
On Sun, May 6, 2018 at 3:37 PM, Dorota Czaplejewicz
 wrote:
> On Sat, 5 May 2018 13:37:44 +0200
> Silvan Jegen  wrote:
>
>> On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:
>> > On Fri, 4 May 2018 22:32:15 +0200
>> > Silvan Jegen  wrote:
>> >
>> > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
>> > > > On Thu, 3 May 2018 21:55:40 +0200
>> > > > Silvan Jegen  wrote:
>> > > >
>> > > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
>> > > > > > On Thu, 3 May 2018 20:47:27 +0200
>> > > > > > Silvan Jegen  wrote:
>> > > > > >
>> > > > > > > Hi Dorota
>> > > > > > >
>> > > > > > > Some comments and typo fixes below.
>> > > > > > >
>> > > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz 
>> > > > > > > wrote:
>> > > > > > > > +  Text is valid UTF-8 encoded, indices and lengths are in 
>> > > > > > > > code points. If a
>> > > > > > > > +  grapheme is made up of multiple code points, an index 
>> > > > > > > > pointing to any of
>> > > > > > > > +  them should be interpreted as pointing to the first one.
>> > > > > > >
>> > > > > > > That way we make sure we don't put the cursor/anchor between 
>> > > > > > > bytes that
>> > > > > > > belong to the same UTF-8 encoded Unicode code point which is 
>> > > > > > > nice. It
>> > > > > > > also means that the client has to parse all the UTF-8 encoded 
>> > > > > > > strings
>> > > > > > > into Unicode code points up to the desired cursor/anchor position
>> > > > > > > on each "preedit_string" event. For each 
>> > > > > > > "delete_surrounding_text" event
>> > > > > > > the client has to parse the UTF-8 sequences before and after the 
>> > > > > > > cursor
>> > > > > > > position up to the requested Unicode code point.
>> > > > > > >
>> > > > > > > I feel like we are processing the UTF-8 string already in the
>> > > > > > > input-method. So I am not sure that we should parse it again on 
>> > > > > > > the
>> > > > > > > client side. Parsing it again would also mean that the client 
>> > > > > > > would need
>> > > > > > > to know about UTF-8 which would be nice to avoid.
>> > > > > > >
>> > > > > > > Thoughts?
>> > > > > >
>> > > > > > The client needs to know about Unicode, but not necessarily about
>> > > > > > UTF-8. Specifying code points is actually an advantage here, 
>> > > > > > because
>> > > > > > byte offsets are inherently expressed relative to UTF-8. By 
>> > > > > > counting
>> > > > > > with code points, client's internal representation can be UTF-16 or
>> > > > > > maybe even something else.
>> > > > >
>> > > > > Maybe I am misunderstanding something but the protocol specifies that
>> > > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets 
>> > > > > into
>> > > > > the strings are specified in Unicode points. To me that indicates 
>> > > > > that
>> > > > > the application *has to parse* the UTF-8 string into Unicode points
>> > > > > when receiving the event otherwise it doesn't know after which 
>> > > > > Unicode
>> > > > > point to draw the cursor. Of course the application can then decide 
>> > > > > to
>> > > > > convert the UTF-8 string into another encoding like UTF-16 for 
>> > > > > internal
>> > > > > processing (for whatever reason) but that doesn't change the fact 
>> > > > > that
>> > > > > it still would have to parse the incoming UTF-8 (and thus know about
>> > > > > UTF-8).
>> > > > >
>> > > > Can you see any way to avoid parsing UTF-8 in order to draw the
>> > > > cursor? I tried to come up with a way to do that, but even with
>> > > > specifying byte strings, I believe that calculating the position of
>> > > > the cursor - either in pixels or in glyphs - requires full parsing of
>> > > > the input string.
>> > >
>> > > Yes, I don't think it's avoidable either. You just don't have to do
>> > > it twice if your text rendering library consumes UTF-8 strings with
>> > > byte-offsets though. See my response below.
>> > >
>> > >
>> > > > > > There's no avoiding the parsing either. What the application cares
>> > > > > > about is that the cursor falls between glyphs. The application 
>> > > > > > cannot
>> > > > > > know that in all cases. Unicode allows the same sequence to be
>> > > > > > displayed in multiple ways (fallback):
>> > > > > >
>> > > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
>> > > > > >
>> > > > > > One could make an argument that byte offsets should never be close
>> > > > > > to ZWJ characters, but I think this decision is better left to the
>> > > > > > application, which knows what exactly it is presenting to the user.
>> > > > >
>> > > > > The idea of the previous version of the protocol (from my 
>> > > > > understanding)
>> > > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
>> > > > > falling between bytes of a Unicode code point) into the string 

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-06 Thread Dorota Czaplejewicz
On Sat, 5 May 2018 13:37:44 +0200
Silvan Jegen  wrote:

> On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:
> > On Fri, 4 May 2018 22:32:15 +0200
> > Silvan Jegen  wrote:
> >   
> > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:  
> > > > On Thu, 3 May 2018 21:55:40 +0200
> > > > Silvan Jegen  wrote:
> > > > 
> > > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:  
> > > > >   
> > > > > > On Thu, 3 May 2018 20:47:27 +0200
> > > > > > Silvan Jegen  wrote:
> > > > > >   
> > > > > > > Hi Dorota
> > > > > > > 
> > > > > > > Some comments and typo fixes below.
> > > > > > > 
> > > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz 
> > > > > > > wrote:  
> > > > > > > > +  Text is valid UTF-8 encoded, indices and lengths are in 
> > > > > > > > code points. If a
> > > > > > > > +  grapheme is made up of multiple code points, an index 
> > > > > > > > pointing to any of
> > > > > > > > +  them should be interpreted as pointing to the first one. 
> > > > > > > >
> > > > > > > 
> > > > > > > That way we make sure we don't put the cursor/anchor between 
> > > > > > > bytes that
> > > > > > > belong to the same UTF-8 encoded Unicode code point which is 
> > > > > > > nice. It
> > > > > > > also means that the client has to parse all the UTF-8 encoded 
> > > > > > > strings
> > > > > > > into Unicode code points up to the desired cursor/anchor position
> > > > > > > on each "preedit_string" event. For each 
> > > > > > > "delete_surrounding_text" event
> > > > > > > the client has to parse the UTF-8 sequences before and after the 
> > > > > > > cursor
> > > > > > > position up to the requested Unicode code point.
> > > > > > > 
> > > > > > > I feel like we are processing the UTF-8 string already in the
> > > > > > > input-method. So I am not sure that we should parse it again on 
> > > > > > > the
> > > > > > > client side. Parsing it again would also mean that the client 
> > > > > > > would need
> > > > > > > to know about UTF-8 which would be nice to avoid.
> > > > > > > 
> > > > > > > Thoughts?  
> > > > > > 
> > > > > > The client needs to know about Unicode, but not necessarily about
> > > > > > UTF-8. Specifying code points is actually an advantage here, because
> > > > > > byte offsets are inherently expressed relative to UTF-8. By counting
> > > > > > with code points, client's internal representation can be UTF-16 or
> > > > > > maybe even something else.  
> > > > > 
> > > > > Maybe I am misunderstanding something but the protocol specifies that
> > > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> > > > > the strings are specified in Unicode points. To me that indicates that
> > > > > the application *has to parse* the UTF-8 string into Unicode points
> > > > > when receiving the event otherwise it doesn't know after which Unicode
> > > > > point to draw the cursor. Of course the application can then decide to
> > > > > convert the UTF-8 string into another encoding like UTF-16 for 
> > > > > internal
> > > > > processing (for whatever reason) but that doesn't change the fact that
> > > > > it still would have to parse the incoming UTF-8 (and thus know about
> > > > > UTF-8).
> > > > > 
> > > > Can you see any way to avoid parsing UTF-8 in order to draw the
> > > > cursor? I tried to come up with a way to do that, but even with
> > > > specifying byte strings, I believe that calculating the position of
> > > > the cursor - either in pixels or in glyphs - requires full parsing of
> > > > the input string.
> > > 
> > > Yes, I don't think it's avoidable either. You just don't have to do
> > > it twice if your text rendering library consumes UTF-8 strings with
> > > byte-offsets though. See my response below.
> > > 
> > >   
> > > > > > There's no avoiding the parsing either. What the application cares
> > > > > > about is that the cursor falls between glyphs. The application 
> > > > > > cannot
> > > > > > know that in all cases. Unicode allows the same sequence to be
> > > > > > displayed in multiple ways (fallback):
> > > > > > 
> > > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > > > > > 
> > > > > > One could make an argument that byte offsets should never be close
> > > > > > to ZWJ characters, but I think this decision is better left to the
> > > > > > application, which knows what exactly it is presenting to the user. 
> > > > > >  
> > > > > 
> > > > > The idea of the previous version of the protocol (from my 
> > > > > understanding)
> > > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> > > > > falling between bytes of a Unicode code point) into the string will be
> > > > > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> > > > > string you trust the sender to honor the protocol and 

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-05 Thread Silvan Jegen
On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote:
> On Fri, 4 May 2018 22:32:15 +0200
> Silvan Jegen  wrote:
> 
> > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
> > > On Thu, 3 May 2018 21:55:40 +0200
> > > Silvan Jegen  wrote:
> > >   
> > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:  
> > > > > On Thu, 3 May 2018 20:47:27 +0200
> > > > > Silvan Jegen  wrote:
> > > > > 
> > > > > > Hi Dorota
> > > > > > 
> > > > > > Some comments and typo fixes below.
> > > > > > 
> > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz 
> > > > > > wrote:
> > > > > > > +  Text is valid UTF-8 encoded, indices and lengths are in 
> > > > > > > code points. If a
> > > > > > > +  grapheme is made up of multiple code points, an index 
> > > > > > > pointing to any of
> > > > > > > +  them should be interpreted as pointing to the first one.   
> > > > > > >
> > > > > > 
> > > > > > That way we make sure we don't put the cursor/anchor between bytes 
> > > > > > that
> > > > > > belong to the same UTF-8 encoded Unicode code point which is nice. 
> > > > > > It
> > > > > > also means that the client has to parse all the UTF-8 encoded 
> > > > > > strings
> > > > > > into Unicode code points up to the desired cursor/anchor position
> > > > > > on each "preedit_string" event. For each "delete_surrounding_text" 
> > > > > > event
> > > > > > the client has to parse the UTF-8 sequences before and after the 
> > > > > > cursor
> > > > > > position up to the requested Unicode code point.
> > > > > > 
> > > > > > I feel like we are processing the UTF-8 string already in the
> > > > > > input-method. So I am not sure that we should parse it again on the
> > > > > > client side. Parsing it again would also mean that the client would 
> > > > > > need
> > > > > > to know about UTF-8 which would be nice to avoid.
> > > > > > 
> > > > > > Thoughts?
> > > > > 
> > > > > The client needs to know about Unicode, but not necessarily about
> > > > > UTF-8. Specifying code points is actually an advantage here, because
> > > > > byte offsets are inherently expressed relative to UTF-8. By counting
> > > > > with code points, client's internal representation can be UTF-16 or
> > > > > maybe even something else.
> > > > 
> > > > Maybe I am misunderstanding something but the protocol specifies that
> > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> > > > the strings are specified in Unicode points. To me that indicates that
> > > > the application *has to parse* the UTF-8 string into Unicode points
> > > > when receiving the event otherwise it doesn't know after which Unicode
> > > > point to draw the cursor. Of course the application can then decide to
> > > > convert the UTF-8 string into another encoding like UTF-16 for internal
> > > > processing (for whatever reason) but that doesn't change the fact that
> > > > it still would have to parse the incoming UTF-8 (and thus know about
> > > > UTF-8).
> > > >   
> > > Can you see any way to avoid parsing UTF-8 in order to draw the
> > > cursor? I tried to come up with a way to do that, but even with
> > > specifying byte strings, I believe that calculating the position of
> > > the cursor - either in pixels or in glyphs - requires full parsing of
> > > the input string.  
> > 
> > Yes, I don't think it's avoidable either. You just don't have to do
> > it twice if your text rendering library consumes UTF-8 strings with
> > byte-offsets though. See my response below.
> > 
> > 
> > > > > There's no avoiding the parsing either. What the application cares
> > > > > about is that the cursor falls between glyphs. The application cannot
> > > > > know that in all cases. Unicode allows the same sequence to be
> > > > > displayed in multiple ways (fallback):
> > > > > 
> > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > > > > 
> > > > > One could make an argument that byte offsets should never be close
> > > > > to ZWJ characters, but I think this decision is better left to the
> > > > > application, which knows what exactly it is presenting to the user.   
> > > > >  
> > > > 
> > > > The idea of the previous version of the protocol (from my understanding)
> > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> > > > falling between bytes of a Unicode code point) into the string will be
> > > > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> > > > string you trust the sender to honor the protocol and thus you can just
> > > > pass the UTF-8 encoded string unprocessed to your text rendering library
> > > > (provided that the library supports UTF-8 strings which is what I am
> > > > assuming) without having to parse the UTF-8 string into Unicode code
> > > > points.
> > > > 
> > > > Of course the Unicode code points will have to be parsed at 

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-05 Thread Dorota Czaplejewicz
On Fri, 4 May 2018 22:32:15 +0200
Silvan Jegen  wrote:

> On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
> > On Thu, 3 May 2018 21:55:40 +0200
> > Silvan Jegen  wrote:
> >   
> > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:  
> > > > On Thu, 3 May 2018 20:47:27 +0200
> > > > Silvan Jegen  wrote:
> > > > 
> > > > > Hi Dorota
> > > > > 
> > > > > Some comments and typo fixes below.
> > > > > 
> > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:  
> > > > >   
> > > > > > +  Text is valid UTF-8 encoded, indices and lengths are in code 
> > > > > > points. If a
> > > > > > +  grapheme is made up of multiple code points, an index 
> > > > > > pointing to any of
> > > > > > +  them should be interpreted as pointing to the first one. 
> > > > > >  
> > > > > 
> > > > > That way we make sure we don't put the cursor/anchor between bytes 
> > > > > that
> > > > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > > > > also means that the client has to parse all the UTF-8 encoded strings
> > > > > into Unicode code points up to the desired cursor/anchor position
> > > > > on each "preedit_string" event. For each "delete_surrounding_text" 
> > > > > event
> > > > > the client has to parse the UTF-8 sequences before and after the 
> > > > > cursor
> > > > > position up to the requested Unicode code point.
> > > > > 
> > > > > I feel like we are processing the UTF-8 string already in the
> > > > > input-method. So I am not sure that we should parse it again on the
> > > > > client side. Parsing it again would also mean that the client would 
> > > > > need
> > > > > to know about UTF-8 which would be nice to avoid.
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > The client needs to know about Unicode, but not necessarily about
> > > > UTF-8. Specifying code points is actually an advantage here, because
> > > > byte offsets are inherently expressed relative to UTF-8. By counting
> > > > with code points, client's internal representation can be UTF-16 or
> > > > maybe even something else.
> > > 
> > > Maybe I am misunderstanding something but the protocol specifies that
> > > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> > > the strings are specified in Unicode points. To me that indicates that
> > > the application *has to parse* the UTF-8 string into Unicode points
> > > when receiving the event otherwise it doesn't know after which Unicode
> > > point to draw the cursor. Of course the application can then decide to
> > > convert the UTF-8 string into another encoding like UTF-16 for internal
> > > processing (for whatever reason) but that doesn't change the fact that
> > > it still would have to parse the incoming UTF-8 (and thus know about
> > > UTF-8).
> > >   
> > Can you see any way to avoid parsing UTF-8 in order to draw the
> > cursor? I tried to come up with a way to do that, but even with
> > specifying byte strings, I believe that calculating the position of
> > the cursor - either in pixels or in glyphs - requires full parsing of
> > the input string.  
> 
> Yes, I don't think it's avoidable either. You just don't have to do
> it twice if your text rendering library consumes UTF-8 strings with
> byte-offsets though. See my response below.
> 
> 
> > > > There's no avoiding the parsing either. What the application cares
> > > > about is that the cursor falls between glyphs. The application cannot
> > > > know that in all cases. Unicode allows the same sequence to be
> > > > displayed in multiple ways (fallback):
> > > > 
> > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > > > 
> > > > One could make an argument that byte offsets should never be close
> > > > to ZWJ characters, but I think this decision is better left to the
> > > > application, which knows what exactly it is presenting to the user.
> > > 
> > > The idea of the previous version of the protocol (from my understanding)
> > > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> > > falling between bytes of a Unicode code point) into the string will be
> > > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> > > string you trust the sender to honor the protocol and thus you can just
> > > pass the UTF-8 encoded string unprocessed to your text rendering library
> > > (provided that the library supports UTF-8 strings which is what I am
> > > assuming) without having to parse the UTF-8 string into Unicode code
> > > points.
> > > 
> > > Of course the Unicode code points will have to be parsed at some point
> > > if you want to render them. Using byte-offsets just lets you do that at
> > > a later stage if your libraries support UTF-8.
> > > 
> > >   
> > Doesn't that chiefly depend on what kind of the text rendering library
> > though? As far as I understand, passing text to 

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-04 Thread Silvan Jegen
On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote:
> On Thu, 3 May 2018 21:55:40 +0200
> Silvan Jegen  wrote:
> 
> > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
> > > On Thu, 3 May 2018 20:47:27 +0200
> > > Silvan Jegen  wrote:
> > >   
> > > > Hi Dorota
> > > > 
> > > > Some comments and typo fixes below.
> > > > 
> > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:  
> > > > > +  Text is valid UTF-8 encoded, indices and lengths are in code 
> > > > > points. If a
> > > > > +  grapheme is made up of multiple code points, an index pointing 
> > > > > to any of
> > > > > +  them should be interpreted as pointing to the first one.
> > > > 
> > > > That way we make sure we don't put the cursor/anchor between bytes that
> > > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > > > also means that the client has to parse all the UTF-8 encoded strings
> > > > into Unicode code points up to the desired cursor/anchor position
> > > > on each "preedit_string" event. For each "delete_surrounding_text" event
> > > > the client has to parse the UTF-8 sequences before and after the cursor
> > > > position up to the requested Unicode code point.
> > > > 
> > > > I feel like we are processing the UTF-8 string already in the
> > > > input-method. So I am not sure that we should parse it again on the
> > > > client side. Parsing it again would also mean that the client would need
> > > > to know about UTF-8 which would be nice to avoid.
> > > > 
> > > > Thoughts?  
> > > 
> > > The client needs to know about Unicode, but not necessarily about
> > > UTF-8. Specifying code points is actually an advantage here, because
> > > byte offsets are inherently expressed relative to UTF-8. By counting
> > > with code points, client's internal representation can be UTF-16 or
> > > maybe even something else.  
> > 
> > Maybe I am misunderstanding something but the protocol specifies that
> > the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> > the strings are specified in Unicode points. To me that indicates that
> > the application *has to parse* the UTF-8 string into Unicode points
> > when receiving the event otherwise it doesn't know after which Unicode
> > point to draw the cursor. Of course the application can then decide to
> > convert the UTF-8 string into another encoding like UTF-16 for internal
> > processing (for whatever reason) but that doesn't change the fact that
> > it still would have to parse the incoming UTF-8 (and thus know about
> > UTF-8).
> > 
> Can you see any way to avoid parsing UTF-8 in order to draw the
> cursor? I tried to come up with a way to do that, but even with
> specifying byte strings, I believe that calculating the position of
> the cursor - either in pixels or in glyphs - requires full parsing of
> the input string.

Yes, I don't think it's avoidable either. You just don't have to do
it twice if your text rendering library consumes UTF-8 strings with
byte-offsets though. See my response below.


> > > There's no avoiding the parsing either. What the application cares
> > > about is that the cursor falls between glyphs. The application cannot
> > > know that in all cases. Unicode allows the same sequence to be
> > > displayed in multiple ways (fallback):
> > > 
> > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > > 
> > > One could make an argument that byte offsets should never be close
> > > to ZWJ characters, but I think this decision is better left to the
> > > application, which knows what exactly it is presenting to the user.  
> > 
> > The idea of the previous version of the protocol (from my understanding)
> > was to make sure that only valid UTF-8 and valid byte-offsets (== not
> > falling between bytes of a Unicode code point) into the string will be
> > sent to the client. If you just get a byte-offset into a UTF-8 encoded
> > string you trust the sender to honor the protocol and thus you can just
> > pass the UTF-8 encoded string unprocessed to your text rendering library
> > (provided that the library supports UTF-8 strings which is what I am
> > assuming) without having to parse the UTF-8 string into Unicode code
> > points.
> > 
> > Of course the Unicode code points will have to be parsed at some point
> > if you want to render them. Using byte-offsets just lets you do that at
> > a later stage if your libraries support UTF-8.
> > 
> > 
> Doesn't that chiefly depend on what kind of the text rendering library
> though? As far as I understand, passing text to rendering is necessary
> to calculate the cursor position. At the same time, it doesn't matter
> much for the calculations whether the cursor offset is in bytes or
> code points - the library does the parsing in the last step anyway.
> 
> I think you mean that if the rendering library accepts byte offsets
> as the only format, the application would 

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-03 Thread Dorota Czaplejewicz
On Thu, 3 May 2018 21:55:40 +0200
Silvan Jegen  wrote:

> On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
> > On Thu, 3 May 2018 20:47:27 +0200
> > Silvan Jegen  wrote:
> >   
> > > Hi Dorota
> > > 
> > > Some comments and typo fixes below.
> > > 
> > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:  
> > > > +  Text is valid UTF-8 encoded, indices and lengths are in code 
> > > > points. If a
> > > > +  grapheme is made up of multiple code points, an index pointing 
> > > > to any of
> > > > +  them should be interpreted as pointing to the first one.
> > > 
> > > That way we make sure we don't put the cursor/anchor between bytes that
> > > belong to the same UTF-8 encoded Unicode code point which is nice. It
> > > also means that the client has to parse all the UTF-8 encoded strings
> > > into Unicode code points up to the desired cursor/anchor position
> > > on each "preedit_string" event. For each "delete_surrounding_text" event
> > > the client has to parse the UTF-8 sequences before and after the cursor
> > > position up to the requested Unicode code point.
> > > 
> > > I feel like we are processing the UTF-8 string already in the
> > > input-method. So I am not sure that we should parse it again on the
> > > client side. Parsing it again would also mean that the client would need
> > > to know about UTF-8 which would be nice to avoid.
> > > 
> > > Thoughts?  
> > 
> > The client needs to know about Unicode, but not necessarily about
> > UTF-8. Specifying code points is actually an advantage here, because
> > byte offsets are inherently expressed relative to UTF-8. By counting
> > with code points, client's internal representation can be UTF-16 or
> > maybe even something else.  
> 
> Maybe I am misunderstanding something but the protocol specifies that
> the strings are valid UTF-8 encoded and the cursor/anchor offsets into
> the strings are specified in Unicode points. To me that indicates that
> the application *has to parse* the UTF-8 string into Unicode points
> when receiving the event otherwise it doesn't know after which Unicode
> point to draw the cursor. Of course the application can then decide to
> convert the UTF-8 string into another encoding like UTF-16 for internal
> processing (for whatever reason) but that doesn't change the fact that
> it still would have to parse the incoming UTF-8 (and thus know about
> UTF-8).
> 
Can you see any way to avoid parsing UTF-8 in order to draw the cursor? I tried 
to come up with a way to do that, but even with specifying byte strings, I 
believe that calculating the position of the cursor - either in pixels or in 
glyphs - requires full parsing of the input string.

> 
> > There's no avoiding the parsing either. What the application cares
> > about is that the cursor falls between glyphs. The application cannot
> > know that in all cases. Unicode allows the same sequence to be
> > displayed in multiple ways (fallback):
> > 
> > https://unicode.org/emoji/charts/emoji-zwj-sequences.html
> > 
> > One could make an argument that byte offsets should never be close
> > to ZWJ characters, but I think this decision is better left to the
> > application, which knows what exactly it is presenting to the user.  
> 
> The idea of the previous version of the protocol (from my understanding)
> was to make sure that only valid UTF-8 and valid byte-offsets (== not
> falling between bytes of a Unicode code point) into the string will be
> sent to the client. If you just get a byte-offset into a UTF-8 encoded
> string you trust the sender to honor the protocol and thus you can just
> pass the UTF-8 encoded string unprocessed to your text rendering library
> (provided that the library supports UTF-8 strings which is what I am
> assuming) without having to parse the UTF-8 string into Unicode code
> points.
> 
> Of course the Unicode code points will have to be parsed at some point
> if you want to render them. Using byte-offsets just lets you do that at
> a later stage if your libraries support UTF-8.
> 
> 
Doesn't that chiefly depend on what kind of the text rendering library though? 
As far as I understand, passing text to rendering is necessary to calculate the 
cursor position. At the same time, it doesn't matter much for the calculations 
whether the cursor offset is in bytes or code points - the library does the 
parsing in the last step anyway.

I think you mean that if the rendering library accepts byte offsets as the only 
format, the application would have to parse the UTF-8 unnecessarily. I agree 
with this, but I'm not sure we should optimize for this case. Other libraries 
may support only code points instead.

Did I understand you correctly?

Cheers,
Dorota


pgpRcIk5PzRW4.pgp
Description: OpenPGP digital signature
___
wayland-devel mailing list
wayland-devel@lists.freedesktop.org

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-03 Thread Silvan Jegen
On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote:
> On Thu, 3 May 2018 20:47:27 +0200
> Silvan Jegen  wrote:
> 
> > Hi Dorota
> > 
> > Some comments and typo fixes below.
> > 
> > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:
> > > This new protocol description is a simplification over v2.
> > > 
> > > - All pre-edit text styling is gone.
> > > - Pre-edit cursor can span characters.
> > > - No events regarding input panel (OSK) state nor covered rectangle.
> > >   Compositors are still free to handle situations where the keyboard
> > >   focus rectangle is covered by the input panel.
> > > - No set_preferred_language request for clients.
> > > - There is no event to send keysyms. Compositors can use wl_keyboard
> > >   interface instead.
> > > - All state is double-buffered, with specified state.
> > > - Use Unicode codepoints to measure strings.
> > > 
> > > Signed-off-by: Dorota Czaplejewicz 
> > > Signed-off-by: Carlos Garnacho 
> > > ---
> > > This is the next update coming from Purism to perfect the text input 
> > > protocol.
> > > 
> > > The following changes added on top of PATCHv3:
> > > 
> > > - Fixed whitespaces.
> > > - Removed enable flags - the same information can be gathered from
> > > the first requests after enter.
> > > - Changed offsets inside UTF-8 strings to use Unicode character
> > > counts in order to remove the possibility of communicating invalid
> > > state.
> > > - Specified the exact lifetime of double-buffered state, and initial 
> > > values.
> > > - Made changes requested by the IM double-buffered.
> > > 
> > > Some questions remain open. One is: how to specify how much text
> > > to capture in set_surrounding_text, and how often to update?
> > > 
> > > A possible change that I decided against for now is to replace
> > > enable/disable events by create/destroy of a new object, which
> > > would make more state lifetimes encoded in the protocol.
> > > 
> > > After reading a blog post on fcitx [0], I got the impression that
> > > letting the compositor know some persistent ID of a text edit
> > > instance could be useful, however I'm not sure what the use cases
> > > are.
> > > 
> > > As always, I'm happy to hear feedback.
> > > 
> > > Cheers,
> > > Dorota Czaplejewicz
> > > 
> > > [0] 
> > > https://www.csslayer.info/wordpress/fcitx-dev/gaps-between-wayland-and-fcitx-or-all-input-methods/
> > > 
> > >  Makefile.am|   1 +
> > >  unstable/text-input/text-input-unstable-v3.xml | 362 
> > > +
> > >  2 files changed, 363 insertions(+)
> > >  create mode 100644 unstable/text-input/text-input-unstable-v3.xml
> > > 
> > > diff --git a/Makefile.am b/Makefile.am
> > > index 4b9a901..86d7ca9 100644
> > > --- a/Makefile.am
> > > +++ b/Makefile.am
> > > @@ -3,6 +3,7 @@ unstable_protocols =  
> > > \
> > >   unstable/fullscreen-shell/fullscreen-shell-unstable-v1.xml  
> > > \
> > >   unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml  
> > > \
> > >   unstable/text-input/text-input-unstable-v1.xml  
> > > \
> > > + unstable/text-input/text-input-unstable-v3.xml  
> > > \
> > >   unstable/input-method/input-method-unstable-v1.xml  
> > > \
> > >   unstable/xdg-shell/xdg-shell-unstable-v5.xml
> > > \
> > >   unstable/xdg-shell/xdg-shell-unstable-v6.xml
> > > \
> > > diff --git a/unstable/text-input/text-input-unstable-v3.xml 
> > > b/unstable/text-input/text-input-unstable-v3.xml
> > > new file mode 100644
> > > index 000..ed5204f
> > > --- /dev/null
> > > +++ b/unstable/text-input/text-input-unstable-v3.xml
> > > @@ -0,0 +1,362 @@
> > > +
> > > +
> > > +
> > > +  
> > > +Copyright © 2012, 2013 Intel Corporation
> > > +Copyright © 2015, 2016 Jan Arne Petersen
> > > +Copyright © 2017, 2018 Red Hat, Inc.
> > > +Copyright © 2018 Purism SPC
> > > +
> > > +Permission to use, copy, modify, distribute, and sell this
> > > +software and its documentation for any purpose is hereby granted
> > > +without fee, provided that the above copyright notice appear in
> > > +all copies and that both that copyright notice and this permission
> > > +notice appear in supporting documentation, and that the name of
> > > +the copyright holders not be used in advertising or publicity
> > > +pertaining to distribution of the software without specific,
> > > +written prior permission.  The copyright holders make no
> > > +representations about the suitability of this software for any
> > > +purpose.  It is provided "as is" without express or implied
> > > +warranty.
> > > +
> > > +THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
> > > +SOFTWARE, 

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-03 Thread Dorota Czaplejewicz
On Thu, 3 May 2018 20:47:27 +0200
Silvan Jegen  wrote:

> Hi Dorota
> 
> Some comments and typo fixes below.
> 
> On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:
> > This new protocol description is a simplification over v2.
> > 
> > - All pre-edit text styling is gone.
> > - Pre-edit cursor can span characters.
> > - No events regarding input panel (OSK) state nor covered rectangle.
> >   Compositors are still free to handle situations where the keyboard
> >   focus rectangle is covered by the input panel.
> > - No set_preferred_language request for clients.
> > - There is no event to send keysyms. Compositors can use wl_keyboard
> >   interface instead.
> > - All state is double-buffered, with specified state.
> > - Use Unicode codepoints to measure strings.
> > 
> > Signed-off-by: Dorota Czaplejewicz 
> > Signed-off-by: Carlos Garnacho 
> > ---
> > This is the next update coming from Purism to perfect the text input 
> > protocol.
> > 
> > The following changes added on top of PATCHv3:
> > 
> > - Fixed whitespaces.
> > - Removed enable flags - the same information can be gathered from the 
> > first requests after enter.
> > - Changed offsets inside UTF-8 strings to use Unicode character counts in 
> > order to remove the possibility of communicating invalid state.
> > - Specified the exact lifetime of double-buffered state, and initial values.
> > - Made changes requested by the IM double-buffered.
> > 
> > Some questions remain open. One is: how to specify how much text to capture 
> > in set_surrounding_text, and how often to update?
> > 
> > A possible change that I decided against for now is to replace 
> > enable/disable events by create/destroy of a new object, which would make 
> > more state lifetimes encoded in the protocol.
> > 
> > After reading a blog post on fcitx [0], I got the impression that letting 
> > the compositor know some persistent ID of a text edit instance could be 
> > useful, however I'm not sure what the use cases are.
> > 
> > As always, I'm happy to hear feedback.
> > 
> > Cheers,
> > Dorota Czaplejewicz
> > 
> > [0] 
> > https://www.csslayer.info/wordpress/fcitx-dev/gaps-between-wayland-and-fcitx-or-all-input-methods/
> > 
> >  Makefile.am|   1 +
> >  unstable/text-input/text-input-unstable-v3.xml | 362 
> > +
> >  2 files changed, 363 insertions(+)
> >  create mode 100644 unstable/text-input/text-input-unstable-v3.xml
> > 
> > diff --git a/Makefile.am b/Makefile.am
> > index 4b9a901..86d7ca9 100644
> > --- a/Makefile.am
> > +++ b/Makefile.am
> > @@ -3,6 +3,7 @@ unstable_protocols =
> > \
> > unstable/fullscreen-shell/fullscreen-shell-unstable-v1.xml  
> > \
> > unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml  
> > \
> > unstable/text-input/text-input-unstable-v1.xml  
> > \
> > +   unstable/text-input/text-input-unstable-v3.xml  
> > \
> > unstable/input-method/input-method-unstable-v1.xml  
> > \
> > unstable/xdg-shell/xdg-shell-unstable-v5.xml
> > \
> > unstable/xdg-shell/xdg-shell-unstable-v6.xml
> > \
> > diff --git a/unstable/text-input/text-input-unstable-v3.xml 
> > b/unstable/text-input/text-input-unstable-v3.xml
> > new file mode 100644
> > index 000..ed5204f
> > --- /dev/null
> > +++ b/unstable/text-input/text-input-unstable-v3.xml
> > @@ -0,0 +1,362 @@
> > +
> > +
> > +
> > +  
> > +Copyright © 2012, 2013 Intel Corporation
> > +Copyright © 2015, 2016 Jan Arne Petersen
> > +Copyright © 2017, 2018 Red Hat, Inc.
> > +Copyright © 2018 Purism SPC
> > +
> > +Permission to use, copy, modify, distribute, and sell this
> > +software and its documentation for any purpose is hereby granted
> > +without fee, provided that the above copyright notice appear in
> > +all copies and that both that copyright notice and this permission
> > +notice appear in supporting documentation, and that the name of
> > +the copyright holders not be used in advertising or publicity
> > +pertaining to distribution of the software without specific,
> > +written prior permission.  The copyright holders make no
> > +representations about the suitability of this software for any
> > +purpose.  It is provided "as is" without express or implied
> > +warranty.
> > +
> > +THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
> > +SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
> > +FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
> > +SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> > +WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
> > +AN ACTION 

Re: [PATCHv4 wayland-protocols] text-input: Add v3 of the text-input protocol

2018-05-03 Thread Silvan Jegen
Hi Dorota

Some comments and typo fixes below.

On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz wrote:
> This new protocol description is a simplification over v2.
> 
> - All pre-edit text styling is gone.
> - Pre-edit cursor can span characters.
> - No events regarding input panel (OSK) state nor covered rectangle.
>   Compositors are still free to handle situations where the keyboard
>   focus rectangle is covered by the input panel.
> - No set_preferred_language request for clients.
> - There is no event to send keysyms. Compositors can use wl_keyboard
>   interface instead.
> - All state is double-buffered, with specified state.
> - Use Unicode codepoints to measure strings.
> 
> Signed-off-by: Dorota Czaplejewicz 
> Signed-off-by: Carlos Garnacho 
> ---
> This is the next update coming from Purism to perfect the text input protocol.
> 
> The following changes added on top of PATCHv3:
> 
> - Fixed whitespaces.
> - Removed enable flags - the same information can be gathered from the first 
> requests after enter.
> - Changed offsets inside UTF-8 strings to use Unicode character counts in 
> order to remove the possibility of communicating invalid state.
> - Specified the exact lifetime of double-buffered state, and initial values.
> - Made changes requested by the IM double-buffered.
> 
> Some questions remain open. One is: how to specify how much text to capture 
> in set_surrounding_text, and how often to update?
> 
> A possible change that I decided against for now is to replace enable/disable 
> events by create/destroy of a new object, which would make more state 
> lifetimes encoded in the protocol.
> 
> After reading a blog post on fcitx [0], I got the impression that letting the 
> compositor know some persistent ID of a text edit instance could be useful, 
> however I'm not sure what the use cases are.
> 
> As always, I'm happy to hear feedback.
> 
> Cheers,
> Dorota Czaplejewicz
> 
> [0] 
> https://www.csslayer.info/wordpress/fcitx-dev/gaps-between-wayland-and-fcitx-or-all-input-methods/
> 
>  Makefile.am|   1 +
>  unstable/text-input/text-input-unstable-v3.xml | 362 
> +
>  2 files changed, 363 insertions(+)
>  create mode 100644 unstable/text-input/text-input-unstable-v3.xml
> 
> diff --git a/Makefile.am b/Makefile.am
> index 4b9a901..86d7ca9 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -3,6 +3,7 @@ unstable_protocols =  
> \
>   unstable/fullscreen-shell/fullscreen-shell-unstable-v1.xml  
> \
>   unstable/linux-dmabuf/linux-dmabuf-unstable-v1.xml  
> \
>   unstable/text-input/text-input-unstable-v1.xml  
> \
> + unstable/text-input/text-input-unstable-v3.xml  
> \
>   unstable/input-method/input-method-unstable-v1.xml  
> \
>   unstable/xdg-shell/xdg-shell-unstable-v5.xml
> \
>   unstable/xdg-shell/xdg-shell-unstable-v6.xml
> \
> diff --git a/unstable/text-input/text-input-unstable-v3.xml 
> b/unstable/text-input/text-input-unstable-v3.xml
> new file mode 100644
> index 000..ed5204f
> --- /dev/null
> +++ b/unstable/text-input/text-input-unstable-v3.xml
> @@ -0,0 +1,362 @@
> +
> +
> +
> +  
> +Copyright © 2012, 2013 Intel Corporation
> +Copyright © 2015, 2016 Jan Arne Petersen
> +Copyright © 2017, 2018 Red Hat, Inc.
> +Copyright © 2018 Purism SPC
> +
> +Permission to use, copy, modify, distribute, and sell this
> +software and its documentation for any purpose is hereby granted
> +without fee, provided that the above copyright notice appear in
> +all copies and that both that copyright notice and this permission
> +notice appear in supporting documentation, and that the name of
> +the copyright holders not be used in advertising or publicity
> +pertaining to distribution of the software without specific,
> +written prior permission.  The copyright holders make no
> +representations about the suitability of this software for any
> +purpose.  It is provided "as is" without express or implied
> +warranty.
> +
> +THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
> +SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
> +FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
> +SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> +WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
> +AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
> +ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF
> +THIS SOFTWARE.
> +  
> +
> +  
> +
> +  The zwp_text_input_v3 interface represents text input and input methods
> +  associated with a seat. It provides