On Fri, Oct 25, 2019, at 14:25, Marvin W wrote: > Yes and no. multi-codepoint emojis are still valid characters when > split, whereas multi-byte codepoints cannot be split. There is > nothing wrong with displaying the flag 🇪🇺 as 🇪🇺 *, so your > implementation is always capable in strictly following any markup > being done on a codepoint basis, even if the markup border is inside > a multi-codepoint emoji.
I don't believe that this is always true, but I don't have a good example off the top of my head, the flag one might be a bad example. Sometimes splitting codepoints will not result in two things that can be displayed, for example if you split just before a zero-width joiner I'm not sure what the behavior should be for ZWJ followed by an emoji. > Some programming languages handle strings in unicode codepoints > instead of bytes. "Some" being the operative word. We're not writing a protocol to be easily used in only "some" programming languages. > I agree that this would be an issue for non messaging content (i.e. > large files) but I don't think we are talking about. For messaging > content, it's no issue that the client has two decode all the bytes - > it will be required to do so anyway for displaying. This happens at some point, but it doesn't have to happen again at the application layer. Like I said, it's a minor problem, but it's definitely more work that I'd prefer not to do. > Assuming you meant codepoint boundary instead of byte boundary Indeed. > I agree that this would also be an option, as long as we make sure > people actually do these checks. I personally prefer codepoints, but > both are valid and sane options - as long as we don't go with grapheme > cluster or any like this, we are fine IMO. I agree, I thought the answer was grepheme clusters for a while but the more I think about it the more this thread has convinced me that it's a bad solution. Currently I'm leaning towards bytes: it's more or less the same as code points except it's simpler to implement and verify and plays nicer with low resource hardware in a trusted environment (where we might not care about doing any checks and assume messages from certain sources are trusted so we don't want to have to decode UTF-8 to figure out where the boundary should be). It does add an extra error case, but it's one that's obviously a fatal error and means the reference can't be rendered: we just have to explicitly say that this makes the reference invalid. —Sam -- Sam Whited _______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org _______________________________________________