[Standards] Re: Proposed XMPP Extension: Jingle Remote Control

Goffi Tue, 21 May 2024 03:47:56 -0700

Hi Marvin,

Le lundi 20 mai 2024, 22:48:42 UTC+2 Marvin W a écrit :
> Hi Goffi,
> 
> See inline comments. Sorry for the wall of text and if it overlaps with
> one of the mails you wrote since I started writing this.
> 
> On Mon, 2024-05-20 at 16:51 +0200, Goffi wrote:
> > There are many benefits to using CBOR:
> > 
> [SNIP]
> The cumulative amount is about 10-20%% [1]. This isn't really a huge
> improvement and almost all events will fit into a single network layer
> frame anyway, further reducing the impact of encoding size.
> 
> [SNIP]
> 
> Segmentation is also inherent to SCTP, the protocol webrtc data
> channels use to transfer content frames. There is no win in segmenting
> the same segments twice.


Note that while recommended, WebRTC Data Channel is not mandatory, and any 
streaming transport may be used. Your arguments are only valid for WebRTC Data 
Channels.

> 
> > - Encoding and decoding CBOR are much more efficient, essential for
> > quick and 
> > efficient data processing, especially by low-resource devices (like
> > Arduinos).
> 
> Not untrue, but probably negligible given the resource use of IP, UDP,
> DTLS, SCTP - all part of the protocol stack you're building on and thus
> involved in every event to be processed. Especially DTLS encryption is
> going to be much more resource hungry than the difference between CBOR
> parser and JSON parser. And notable, CBOR encoding is not a native
> function in web browsers, so if web is a goal of this thing (and
> seemingly it is, given all the references to web tech in the XEP), CBOR
> is probably not much better than JSON.

Working with Web is a goal, but it should work of course outside web too (I 
currently have a web implementation for controlling device, and CLI ones for 
basic controlling device, and for controlled device).

CBOR is not native, but there are many implementations available.

Anyway, I'm not hard set on CBOR. If the consensus is to get rid of it, we can 
get rid of it.

Regarding the choice of web, it's only because sending event, specially with 
keyboard, is hard to do well. There are many different way to encode depending 
on platforms, and various kind of keyboards with special characters. Web API 
is simple, documented, and abstract this complexity. The web has been around 
for 35 years, they have already gone through the rough patches. But again, I'm 
not against switching if there is something as simple and complete.


> 
> [SNIP]
> 
> XEP-0247 Jingle XML streams doesn't need to go via the server, it uses
> Jingle just like your proposed protocol.

I know that, I've just ruled out using <message> through the server as it has 
been proposed in another feedback.

> While the XEP isn't maintained
> for some time and makes weird references to other XEPs, nothing in it
> forbids using it with webrtc data channels. In fact this has been
> discussed as a useful tool for all kinds of things recently (like
> initial device crypto setup or device-to-device MAM).

In general I love the idea of XEP-0247 for many use cases. I just feel that 
XML is not adapted in this particular use case.
 
> And of course latency when sending via a server might be sub-perfect,
> but it's a very similar latency you would see if the network
> environment requires to use a TURN server, which is one of the ways to
> use Jingle.

TURN relay is a worst case scenario. And even then, it's more efficient because 
you don't have to wait for server queue handling, and <message> processing.

> And as mentioned, there are valid use cases for having
> input in cases where low latency is not that crucial. Think of keyboard
> input to a remote shell - essentially what SSH does - which is not
> uncommon to be routed through proxies/tunnels that add latency. Of
> course for game input, drawing and 3d modeling, that's probably not an
> option. It depends a lot on the usecase and that's why flexibility is
> very much a good idea. Building something that is exclusively/primarily
> designed around having a web browser XMPP client connected via Jingle
> webrtc datachannels doesn't sound like flexibility was part of the
> design.

It is not designed around having a web browser at all! It's not because it's 
inspired by web API that it's the case, otherwise every HTTP upload is 
designed for web browser too. Fact is there has been and still is a enormous 
amount of engineering into web techs, and many good things have emerged from 
there, like WebRTC, WebSockets, WebAssembly, etc.

And again, I have a non web implementation already (and a web one).

Sure with ssh latency is less a problem (while still annoying), but the 
current mechanism works in all cases, is simple, and efficient. While adding 
complexity with another mechanism because "there are valid use cases for 
having input in cases where low latency is not that crucial".

> Just as you can "directly" map data from JSON objects from a web
> browser to CBOR, you can directly map them to XML. It's not really a
> good idea to do such a direct mapping in both cases though (e.g. if you
> used enumerated keys in CBOR instead of a string map, you can
> drastically reduce the payload size and improve parsing speed).

To have a successful specification, there is a balance to find between 
efficiency, 
ease of implementation and flexibility. I believe that it's the case with 
string map, and selectively mapping data.
 
> [SNIP]
> 
> As I mentioned in another email: If you really feel like using RTP for
> screen content transfer, you can always decide to only use the RFB
> protocol (or something else) for the input part. I took it as an
> example for an existing protocol that (among other features) has logic
> for remote control input.

Again I'm not hard set on chosen technologies.

I'm not familiar with the internals of RFB, and will look at it. If it's a 
good fit, I'm not against replacing the current events wire format with it.

From a quick glance at the Wikipedia page, I see "In terms of transferring 
clipboard data, "there is currently no way to transfer text outside the 
Latin-1 character set".[5] A common pseudo-encoding extension solves the 
problem by using UTF-8 in an extended format.[2]: § 7.7.27 ", which makes me 
suspicious though.

One of the design goal of my proposal is to have something really simple and 
straightforward to implement.


> 
> Using RFB for screen transfer may be an adjacent topic, but not a
> requirement.

The discussed specification focuses on remote controlling a device, rather than 
screen/audio transfer. It explains how to use it in conjunction with the 
current specification for A/V calls for remote desktop, but designing the 
desktop transfer protocol is out of scope.

Another XEP may be specified if XEP-0167 proves not to be sufficient for 
desktop 
transfer, and this proposal will be usable with it without issue. Such a XEP 
could utilize RFB, SPICE, or whatever.


> 
> [SNIP]
> 
> I just played with the
> https://w3c.github.io/uievents/tools/key-event-viewer.html and it's
> still unclear to me when pressing modifier keys, which events are
> emitted when and what is the supposed state of the modifier flag for
> those events. I figured that the behavior is inconsistent between
> browsers (and probably operating systems) and also between different
> keys in the same browser. I bet this is not intended, but as the
> specification and MDN don't really tell me what the correct behavior
> would be, I can't really blame the browsers either.

There is no modifier flag used in the specification. There is the key value, 
and 
the location number. From my tests, it's consistent and corresponds to the 
documentation for the browsers that I've tried (Firefox and Chromium).

> I'm not saying there aren't any cases where low-latency is important,
> where I disagree is that this is the case in all occasions. If you
> don't have low latency feedback from the remote device, low latency for
> input is very likely not crucial.

I have the feeling that you only see this specification with the remote desktop 
use case point of view. There are other use cases, and one another major one 
is to use a device as input for another one in the same physical location: use 
of a smartphone as ad-hoc touch pad or gamepad for instance. And if low 
latency is easily achieved, I still don't see the point to have other 
mechanism because in some niche case low latency is not that annoying (but 
still is, it's always annoying).

> 
> Anyway, I remain not convinced that XSF is the place to specify a
> remote control protocol from scratch (which is what sections 8 and 9 of
> the XEP are about). Mostly because I feel the XSF does not have the
> competence for doing so (aka. we will probably do things terribly
> wrong, due to lack of experience in the field).

Again, it is not from scratch. It's re-using existing protocols, in a simple, 
working, easy-to-implement, and efficient way.

Thank you for your feedback, and for the rest of your message, I'll take it 
into account for next revision if if the protoXEP is accepted.

> Instead of `<device type="keyboard"/>` I would go with `<keyboard />`,
> allowing for attributes to be added for more information where there is
> fit (e.g. for a mouse have an optional buttons attribute with the
> number of buttons that are on the mouse, or for a gamepad, you might
> want to provide the layout, etc). This also means that to extend new
> devices outside this specification, one can just have a `<gamepad
> xmlns="urn:xmpp:remote-control:gamepad:0" />` or similar. As a general
> guideline, I feel attributes should only be used if the set of possible
> is finite.

The specification says that other child elements can be used in <device> for 
parameters. But you proposition may be cleaner, I'll consider it for a next 
revision if the protoXEP is accepted. Thanks!

> 
> I would strongly opt to not make the use of datachannels a SHOULD in
> this protocol. It really doesn't matter for the purpose of this
> protocol and you don't want to need to upgrade this protocol if a new
> transport protocol becomes available that would be a better fit. Jingle
> does the abstraction to streaming vs datagram, so that application
> protocols don't need to deal with it.

The goal here is to be sure that it will work with web clients, as data 
channels are currently the only way to have direct connection with browsers. I 
can reformulate to only suggest it and get rid of the SHOULD.

> 
> There is a lot of specification for interaction with the Jingle RTP and
> WebRTC protocols. This seems mostly unnecessary.
> - You already write in the requirements that everything should work
> even without Jingle RTP
> - You put that one MUST use the same "WebRTC session" (what is that
> even) for both Jingle RTP and Remote Control. I wouldn't know why this
> is. Of course using existing sessions in Jingle often makes sense
> (that's why it's a feature), but it definitely doesn't need a MUST
> here.

WebRTC has sessions pretty much like Jingle; its ID is what you have in the o= 
line of your SDP.

The goal here is to reuse the connection, and to know which streams are used 
for what. However, this is not ideal, I agree. I have a plan to get rid of 
this section and work on a separate specification to add metadata to 
distinguish which streams are used for what.

> - You write explicitly that Remote Control can be added with content-
> add to existing Jingle RTP sessions. This is already given by the
> Jingle specification, which doesn't limit what content can be added to
> a session (e.g. you can also add a file transfer to an existing call).


> - You say that touch devices should not be used when no video RTP
> session is active. I don't see why this shouldn't be possible. I do own
> a drawing tablet that doesn't have a screen but still is an absolute
> pointing device (aka "touch"). If that device was connected via XMPP,
> it wouldn't need a RTP session to transfer its input.

The issue is that video feed is used in this case to get the screen dimension. 
Without it, we can't get touch event which use absolute position (while for 
mouse, there is a relative position mode for exactly this use case).

An alternative would be to specify screen dimension when establishing the 
remote control session.


> - You say that absolute mouse events should not be used when no video
> RTP session is active. I also don't see why this restriction is in
> place - same as above.
> 
> For both touch and mouse you use x and y coordinates "relative to the
> video stream". What does that mean? x and y are doubles, so are they
> supposed to be relative to the screen, so only values between 0 and 1
> (inclusive) are valid?

No, its value is in pixels, the same as for the Web API. Its double because 
pixels can be subdivided (High-DPI displays, transformations). I realize that, 
besides the link to MDN, this is not explicitly stated; I'll add a notice in 
future revisions.

The Web API was initially using int, and then moved to double. That's the kind 
of reason why I'm using a mapping for the Web API: they went that way, and the 
types are carefully chosen.

> [SNIP]
> 
> Wheel events don't have a screen coordinate. I'm pretty sure they
> should have those, as the cursor position for the movement does matter
> a lot.

Cursor position is handled by other devices (mouse or touch). Wheel by itself 
doesn't has any position (it can be an independent device not linked to a 
mouse).

> 
> If I understood correctly, you specify that a session is a screen share
> session by adding a remote control content without any device. This
> remote control content would thus effectively not be used, but still
> require setup of a data channel. This doesn't seem like a good
> protocol.
> The fact that a video is a screen share should be
> communicated outside this specification and this specification should
> not be involved at all in such a case (as it's not a remote control). A
> remote control without devices should be invalid.

It was just to handle the case where no device is accepted, there was 2 
options:
- reject it totally
- say it's a simple screen share session.

I've chosen the later one. But indeed, data channel is then useless. Can 
change it for the other option.



Thanks for the time you took to review the spec and write this feedback.

As a summary:

- I'm not hard set on technologies, and I'm OK to get rid of CBOR is there is 
consensus on it. I personally still think that it's a superior solution.

- regarding using RFB for input events only, I'll have a deeper look at the 
spec and evaluate it. It may be an option it is comparable in ease of 
implementation, efficiency and flexibility to the current proposal.

- I will take other feedback into account for a future revision.

Thanks!


Best,
Goffi

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Standards mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Standards] Re: Proposed XMPP Extension: Jingle Remote Control

Reply via email to