Hi Goffi, See inline comments. Sorry for the wall of text and if it overlaps with one of the mails you wrote since I started writing this.
On Mon, 2024-05-20 at 16:51 +0200, Goffi wrote: > There are many benefits to using CBOR: > > - It is smaller. While individual pieces of data may be tiny, the > cumulative > amount is significant, and efficiency is crucial. The cumulative amount is about 10-20%% [1]. This isn't really a huge improvement and almost all events will fit into a single network layer frame anyway, further reducing the impact of encoding size. > - Segmentation is inherent in CBOR, so you always know if you have > all the > data. This is beneficial for optimization and security. Segmentation is also inherent to SCTP, the protocol webrtc data channels use to transfer content frames. There is no win in segmenting the same segments twice. > - Encoding and decoding CBOR are much more efficient, essential for > quick and > efficient data processing, especially by low-resource devices (like > Arduinos). Not untrue, but probably negligible given the resource use of IP, UDP, DTLS, SCTP - all part of the protocol stack you're building on and thus involved in every event to be processed. Especially DTLS encryption is going to be much more resource hungry than the difference between CBOR parser and JSON parser. And notable, CBOR encoding is not a native function in web browsers, so if web is a goal of this thing (and seemingly it is, given all the references to web tech in the XEP), CBOR is probably not much better than JSON. > > - If we define a protocol for remote control, I would prefer this > > to be > > a <message>-based protocol that can be used either using a > > traditional > > XMPP connection or via XEP-0247 Jingle XML Streams. > > Using server-based <message> would be highly inefficient. Why send > gamepad data > to the server, incurring delays and extra processing, when you can > send it > directly from your local network? XEP-0247 Jingle XML streams doesn't need to go via the server, it uses Jingle just like your proposed protocol. While the XEP isn't maintained for some time and makes weird references to other XEPs, nothing in it forbids using it with webrtc data channels. In fact this has been discussed as a useful tool for all kinds of things recently (like initial device crypto setup or device-to-device MAM). And of course latency when sending via a server might be sub-perfect, but it's a very similar latency you would see if the network environment requires to use a TURN server, which is one of the ways to use Jingle. And as mentioned, there are valid use cases for having input in cases where low latency is not that crucial. Think of keyboard input to a remote shell - essentially what SSH does - which is not uncommon to be routed through proxies/tunnels that add latency. Of course for game input, drawing and 3d modeling, that's probably not an option. It depends a lot on the usecase and that's why flexibility is very much a good idea. Building something that is exclusively/primarily designed around having a web browser XMPP client connected via Jingle webrtc datachannels doesn't sound like flexibility was part of the design. > Regarding direct XML streams, CBOR is still more efficient. > Additionally, the > protocol is based on web APIs, and CBOR provides a direct mapping. > Using XML > would require reinventing the wheel. Just as you can "directly" map data from JSON objects from a web browser to CBOR, you can directly map them to XML. It's not really a good idea to do such a direct mapping in both cases though (e.g. if you used enumerated keys in CBOR instead of a string map, you can drastically reduce the payload size and improve parsing speed). > The protocol described here is for input sending and potentially > other > features like clipboard sharing, gamepad, and haptic feedback. In > combination > with existing specifications, one use case can be remote desktop. The > goal is > to reuse existing XMPP building blocks to simplify implementation. > That’s what > XMPP is for: coordinating specifications. As I mentioned in another email: If you really feel like using RTP for screen content transfer, you can always decide to only use the RFB protocol (or something else) for the input part. I took it as an example for an existing protocol that (among other features) has logic for remote control input. Using RFB for screen transfer may be an adjacent topic, but not a requirement. > We already have an A/V transmission protocol. With WebRTC, it's > extremely > efficient regarding latency and bandwidth. It’s suitable for remote > desktop > streaming, including robust network traversal mechanisms. Network traversal is on a completely different layer than the protocol to transfer screen content (RTP vs RFB). Nothing speaks against running the RFB protocol over webrtc datachannels. Running RFB over websockets in web browsers is also not well specified anywhere, but is still widely deployed [2]. > And, as mentioned, the protocol comes from Web APIs because they are > simple, > well-documented, and provides a well-thought-out abstraction of the > hardware. Web APIs are designed around what browsers can reasonably do on the machines they run on. That doesn't mean they are well thought out for the generic purpose. I just played with the https://w3c.github.io/uievents/tools/key-event-viewer.html and it's still unclear to me when pressing modifier keys, which events are emitted when and what is the supposed state of the modifier flag for those events. I figured that the behavior is inconsistent between browsers (and probably operating systems) and also between different keys in the same browser. I bet this is not intended, but as the specification and MDN don't really tell me what the correct behavior would be, I can't really blame the browsers either. What I learned is that, as a web developer, you must be prepared to see modifier flags set without a keydown event being emitted for the pressing of the corresponding modifier key and also keyup events being emitted without a corresponding keydown event indicating the key was in fact pressed. So I definitely don't agree to the saying that it must be well- documented and well-thought-out just because it's coming from the Web... > Low latency is crucial for inputs, especially for devices like > gamepads, > touchpads, and mice. Even with keyboards, low latency can be > important, for > instance, when playing a game. I'm not saying there aren't any cases where low-latency is important, where I disagree is that this is the case in all occasions. If you don't have low latency feedback from the remote device, low latency for input is very likely not crucial. Anyway, I remain not convinced that XSF is the place to specify a remote control protocol from scratch (which is what sections 8 and 9 of the XEP are about). Mostly because I feel the XSF does not have the competence for doing so (aka. we will probably do things terribly wrong, due to lack of experience in the field). That doesn't mean we don't need /something/ in XMPP to do the signaling for whatever is used to send remote control events. And using Jingle for this (be it using webrtc datachannels or any other Jingle transport) totally makes sense for low latency. -- There is a bunch of things I would suggest that are not related to this at all. Instead of `<device type="keyboard"/>` I would go with `<keyboard />`, allowing for attributes to be added for more information where there is fit (e.g. for a mouse have an optional buttons attribute with the number of buttons that are on the mouse, or for a gamepad, you might want to provide the layout, etc). This also means that to extend new devices outside this specification, one can just have a `<gamepad xmlns="urn:xmpp:remote-control:gamepad:0" />` or similar. As a general guideline, I feel attributes should only be used if the set of possible is finite. I would strongly opt to not make the use of datachannels a SHOULD in this protocol. It really doesn't matter for the purpose of this protocol and you don't want to need to upgrade this protocol if a new transport protocol becomes available that would be a better fit. Jingle does the abstraction to streaming vs datagram, so that application protocols don't need to deal with it. There is a lot of specification for interaction with the Jingle RTP and WebRTC protocols. This seems mostly unnecessary. - You already write in the requirements that everything should work even without Jingle RTP - You put that one MUST use the same "WebRTC session" (what is that even) for both Jingle RTP and Remote Control. I wouldn't know why this is. Of course using existing sessions in Jingle often makes sense (that's why it's a feature), but it definitely doesn't need a MUST here. - You write explicitly that Remote Control can be added with content- add to existing Jingle RTP sessions. This is already given by the Jingle specification, which doesn't limit what content can be added to a session (e.g. you can also add a file transfer to an existing call). - You say that touch devices should not be used when no video RTP session is active. I don't see why this shouldn't be possible. I do own a drawing tablet that doesn't have a screen but still is an absolute pointing device (aka "touch"). If that device was connected via XMPP, it wouldn't need a RTP session to transfer its input. - You say that absolute mouse events should not be used when no video RTP session is active. I also don't see why this restriction is in place - same as above. For both touch and mouse you use x and y coordinates "relative to the video stream". What does that mean? x and y are doubles, so are they supposed to be relative to the screen, so only values between 0 and 1 (inclusive) are valid? If x and y are absolute values in pixels, why are they doubles? If they are a pixel value, is it the pixels of the screen or the pixels of the video (as the video might use a lower resolution than the actual screen resolution)? I would suggest to go with relative values 0-1. If you want to use an absolute value in pixels, I suggest to make it screen pixels and somehow signal the screen pixels outside and independent of the RTP video resolution. Wheel events don't have a screen coordinate. I'm pretty sure they should have those, as the cursor position for the movement does matter a lot. If I understood correctly, you specify that a session is a screen share session by adding a remote control content without any device. This remote control content would thus effectively not be used, but still require setup of a data channel. This doesn't seem like a good protocol. The fact that a video is a screen share should be communicated outside this specification and this specification should not be involved at all in such a case (as it's not a remote control). A remote control without devices should be invalid. Marvin [1] https://gist.github.com/mar-v-in/003bedfcafb9e49a6ba6083ae374088b [2] https://github.com/novnc/noVNC/wiki/Projects-and-companies-using-noVNC _______________________________________________ Standards mailing list -- [email protected] To unsubscribe send an email to [email protected]
