Hi Marvin, Le lundi 20 mai 2024, 22:48:42 UTC+2 Marvin W a écrit : > Hi Goffi, > > See inline comments. Sorry for the wall of text and if it overlaps with > one of the mails you wrote since I started writing this. > > On Mon, 2024-05-20 at 16:51 +0200, Goffi wrote: > > There are many benefits to using CBOR: > > > [SNIP] > The cumulative amount is about 10-20%% [1]. This isn't really a huge > improvement and almost all events will fit into a single network layer > frame anyway, further reducing the impact of encoding size. > > [SNIP] > > Segmentation is also inherent to SCTP, the protocol webrtc data > channels use to transfer content frames. There is no win in segmenting > the same segments twice.
Note that while recommended, WebRTC Data Channel is not mandatory, and any streaming transport may be used. Your arguments are only valid for WebRTC Data Channels. > > > - Encoding and decoding CBOR are much more efficient, essential for > > quick and > > efficient data processing, especially by low-resource devices (like > > Arduinos). > > Not untrue, but probably negligible given the resource use of IP, UDP, > DTLS, SCTP - all part of the protocol stack you're building on and thus > involved in every event to be processed. Especially DTLS encryption is > going to be much more resource hungry than the difference between CBOR > parser and JSON parser. And notable, CBOR encoding is not a native > function in web browsers, so if web is a goal of this thing (and > seemingly it is, given all the references to web tech in the XEP), CBOR > is probably not much better than JSON. Working with Web is a goal, but it should work of course outside web too (I currently have a web implementation for controlling device, and CLI ones for basic controlling device, and for controlled device). CBOR is not native, but there are many implementations available. Anyway, I'm not hard set on CBOR. If the consensus is to get rid of it, we can get rid of it. Regarding the choice of web, it's only because sending event, specially with keyboard, is hard to do well. There are many different way to encode depending on platforms, and various kind of keyboards with special characters. Web API is simple, documented, and abstract this complexity. The web has been around for 35 years, they have already gone through the rough patches. But again, I'm not against switching if there is something as simple and complete. > > [SNIP] > > XEP-0247 Jingle XML streams doesn't need to go via the server, it uses > Jingle just like your proposed protocol. I know that, I've just ruled out using <message> through the server as it has been proposed in another feedback. > While the XEP isn't maintained > for some time and makes weird references to other XEPs, nothing in it > forbids using it with webrtc data channels. In fact this has been > discussed as a useful tool for all kinds of things recently (like > initial device crypto setup or device-to-device MAM). In general I love the idea of XEP-0247 for many use cases. I just feel that XML is not adapted in this particular use case. > And of course latency when sending via a server might be sub-perfect, > but it's a very similar latency you would see if the network > environment requires to use a TURN server, which is one of the ways to > use Jingle. TURN relay is a worst case scenario. And even then, it's more efficient because you don't have to wait for server queue handling, and <message> processing. > And as mentioned, there are valid use cases for having > input in cases where low latency is not that crucial. Think of keyboard > input to a remote shell - essentially what SSH does - which is not > uncommon to be routed through proxies/tunnels that add latency. Of > course for game input, drawing and 3d modeling, that's probably not an > option. It depends a lot on the usecase and that's why flexibility is > very much a good idea. Building something that is exclusively/primarily > designed around having a web browser XMPP client connected via Jingle > webrtc datachannels doesn't sound like flexibility was part of the > design. It is not designed around having a web browser at all! It's not because it's inspired by web API that it's the case, otherwise every HTTP upload is designed for web browser too. Fact is there has been and still is a enormous amount of engineering into web techs, and many good things have emerged from there, like WebRTC, WebSockets, WebAssembly, etc. And again, I have a non web implementation already (and a web one). Sure with ssh latency is less a problem (while still annoying), but the current mechanism works in all cases, is simple, and efficient. While adding complexity with another mechanism because "there are valid use cases for having input in cases where low latency is not that crucial". > Just as you can "directly" map data from JSON objects from a web > browser to CBOR, you can directly map them to XML. It's not really a > good idea to do such a direct mapping in both cases though (e.g. if you > used enumerated keys in CBOR instead of a string map, you can > drastically reduce the payload size and improve parsing speed). To have a successful specification, there is a balance to find between efficiency, ease of implementation and flexibility. I believe that it's the case with string map, and selectively mapping data. > [SNIP] > > As I mentioned in another email: If you really feel like using RTP for > screen content transfer, you can always decide to only use the RFB > protocol (or something else) for the input part. I took it as an > example for an existing protocol that (among other features) has logic > for remote control input. Again I'm not hard set on chosen technologies. I'm not familiar with the internals of RFB, and will look at it. If it's a good fit, I'm not against replacing the current events wire format with it. From a quick glance at the Wikipedia page, I see "In terms of transferring clipboard data, "there is currently no way to transfer text outside the Latin-1 character set".[5] A common pseudo-encoding extension solves the problem by using UTF-8 in an extended format.[2]: § 7.7.27 ", which makes me suspicious though. One of the design goal of my proposal is to have something really simple and straightforward to implement. > > Using RFB for screen transfer may be an adjacent topic, but not a > requirement. The discussed specification focuses on remote controlling a device, rather than screen/audio transfer. It explains how to use it in conjunction with the current specification for A/V calls for remote desktop, but designing the desktop transfer protocol is out of scope. Another XEP may be specified if XEP-0167 proves not to be sufficient for desktop transfer, and this proposal will be usable with it without issue. Such a XEP could utilize RFB, SPICE, or whatever. > > [SNIP] > > I just played with the > https://w3c.github.io/uievents/tools/key-event-viewer.html and it's > still unclear to me when pressing modifier keys, which events are > emitted when and what is the supposed state of the modifier flag for > those events. I figured that the behavior is inconsistent between > browsers (and probably operating systems) and also between different > keys in the same browser. I bet this is not intended, but as the > specification and MDN don't really tell me what the correct behavior > would be, I can't really blame the browsers either. There is no modifier flag used in the specification. There is the key value, and the location number. From my tests, it's consistent and corresponds to the documentation for the browsers that I've tried (Firefox and Chromium). > I'm not saying there aren't any cases where low-latency is important, > where I disagree is that this is the case in all occasions. If you > don't have low latency feedback from the remote device, low latency for > input is very likely not crucial. I have the feeling that you only see this specification with the remote desktop use case point of view. There are other use cases, and one another major one is to use a device as input for another one in the same physical location: use of a smartphone as ad-hoc touch pad or gamepad for instance. And if low latency is easily achieved, I still don't see the point to have other mechanism because in some niche case low latency is not that annoying (but still is, it's always annoying). > > Anyway, I remain not convinced that XSF is the place to specify a > remote control protocol from scratch (which is what sections 8 and 9 of > the XEP are about). Mostly because I feel the XSF does not have the > competence for doing so (aka. we will probably do things terribly > wrong, due to lack of experience in the field). Again, it is not from scratch. It's re-using existing protocols, in a simple, working, easy-to-implement, and efficient way. Thank you for your feedback, and for the rest of your message, I'll take it into account for next revision if if the protoXEP is accepted. > Instead of `<device type="keyboard"/>` I would go with `<keyboard />`, > allowing for attributes to be added for more information where there is > fit (e.g. for a mouse have an optional buttons attribute with the > number of buttons that are on the mouse, or for a gamepad, you might > want to provide the layout, etc). This also means that to extend new > devices outside this specification, one can just have a `<gamepad > xmlns="urn:xmpp:remote-control:gamepad:0" />` or similar. As a general > guideline, I feel attributes should only be used if the set of possible > is finite. The specification says that other child elements can be used in <device> for parameters. But you proposition may be cleaner, I'll consider it for a next revision if the protoXEP is accepted. Thanks! > > I would strongly opt to not make the use of datachannels a SHOULD in > this protocol. It really doesn't matter for the purpose of this > protocol and you don't want to need to upgrade this protocol if a new > transport protocol becomes available that would be a better fit. Jingle > does the abstraction to streaming vs datagram, so that application > protocols don't need to deal with it. The goal here is to be sure that it will work with web clients, as data channels are currently the only way to have direct connection with browsers. I can reformulate to only suggest it and get rid of the SHOULD. > > There is a lot of specification for interaction with the Jingle RTP and > WebRTC protocols. This seems mostly unnecessary. > - You already write in the requirements that everything should work > even without Jingle RTP > - You put that one MUST use the same "WebRTC session" (what is that > even) for both Jingle RTP and Remote Control. I wouldn't know why this > is. Of course using existing sessions in Jingle often makes sense > (that's why it's a feature), but it definitely doesn't need a MUST > here. WebRTC has sessions pretty much like Jingle; its ID is what you have in the o= line of your SDP. The goal here is to reuse the connection, and to know which streams are used for what. However, this is not ideal, I agree. I have a plan to get rid of this section and work on a separate specification to add metadata to distinguish which streams are used for what. > - You write explicitly that Remote Control can be added with content- > add to existing Jingle RTP sessions. This is already given by the > Jingle specification, which doesn't limit what content can be added to > a session (e.g. you can also add a file transfer to an existing call). > - You say that touch devices should not be used when no video RTP > session is active. I don't see why this shouldn't be possible. I do own > a drawing tablet that doesn't have a screen but still is an absolute > pointing device (aka "touch"). If that device was connected via XMPP, > it wouldn't need a RTP session to transfer its input. The issue is that video feed is used in this case to get the screen dimension. Without it, we can't get touch event which use absolute position (while for mouse, there is a relative position mode for exactly this use case). An alternative would be to specify screen dimension when establishing the remote control session. > - You say that absolute mouse events should not be used when no video > RTP session is active. I also don't see why this restriction is in > place - same as above. > > For both touch and mouse you use x and y coordinates "relative to the > video stream". What does that mean? x and y are doubles, so are they > supposed to be relative to the screen, so only values between 0 and 1 > (inclusive) are valid? No, its value is in pixels, the same as for the Web API. Its double because pixels can be subdivided (High-DPI displays, transformations). I realize that, besides the link to MDN, this is not explicitly stated; I'll add a notice in future revisions. The Web API was initially using int, and then moved to double. That's the kind of reason why I'm using a mapping for the Web API: they went that way, and the types are carefully chosen. > [SNIP] > > Wheel events don't have a screen coordinate. I'm pretty sure they > should have those, as the cursor position for the movement does matter > a lot. Cursor position is handled by other devices (mouse or touch). Wheel by itself doesn't has any position (it can be an independent device not linked to a mouse). > > If I understood correctly, you specify that a session is a screen share > session by adding a remote control content without any device. This > remote control content would thus effectively not be used, but still > require setup of a data channel. This doesn't seem like a good > protocol. > The fact that a video is a screen share should be > communicated outside this specification and this specification should > not be involved at all in such a case (as it's not a remote control). A > remote control without devices should be invalid. It was just to handle the case where no device is accepted, there was 2 options: - reject it totally - say it's a simple screen share session. I've chosen the later one. But indeed, data channel is then useless. Can change it for the other option. Thanks for the time you took to review the spec and write this feedback. As a summary: - I'm not hard set on technologies, and I'm OK to get rid of CBOR is there is consensus on it. I personally still think that it's a superior solution. - regarding using RFB for input events only, I'll have a deeper look at the spec and evaluate it. It may be an option it is comparable in ease of implementation, efficiency and flexibility to the current proposal. - I will take other feedback into account for a future revision. Thanks! Best, Goffi
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Standards mailing list -- [email protected] To unsubscribe send an email to [email protected]
