On Fri, Oct 25, 2019, at 14:25, Marvin W wrote:
> Yes and no. multi-codepoint emojis are still valid characters when
> split, whereas multi-byte codepoints cannot be split. There is
> nothing wrong with displaying the flag 🇪🇺 as 🇪​🇺 *, so your
> implementation is always capable in strictly following any markup
> being done on a codepoint basis, even if the markup border is inside
> a multi-codepoint emoji.

I don't believe that this is always true, but I don't have a good
example off the top of my head, the flag one might be a bad example.
Sometimes splitting codepoints will not result in two things that can be
displayed, for example if you split just before a zero-width joiner I'm
not sure what the behavior should be for ZWJ followed by an emoji.

> Some programming languages handle strings in unicode codepoints
> instead of bytes.

"Some" being the operative word. We're not writing a protocol to be
easily used in only "some" programming languages.

> I agree that this would be an issue for non messaging content (i.e.
> large files) but I don't think we are talking about. For messaging
> content, it's no issue that the client has two decode all the bytes -
> it will be required to do so anyway for displaying.

This happens at some point, but it doesn't have to happen again at the
application layer. Like I said, it's a minor problem, but it's
definitely more work that I'd prefer not to do.

> Assuming you meant codepoint boundary instead of byte boundary

Indeed.

> I agree that this would also be an option, as long as we make sure
> people actually do these checks. I personally prefer codepoints, but
> both are valid and sane options - as long as we don't go with grapheme
> cluster or any like this, we are fine IMO.

I agree, I thought the answer was grepheme clusters for a while but the
more I think about it the more this thread has convinced me that it's a
bad solution.

Currently I'm leaning towards bytes: it's more or less the same as
code points except it's simpler to implement and verify and plays
nicer with low resource hardware in a trusted environment (where we
might not care about doing any checks and assume messages from certain
sources are trusted so we don't want to have to decode UTF-8 to figure
out where the boundary should be). It does add an extra error case,
but it's one that's obviously a fatal error and means the reference
can't be rendered: we just have to explicitly say that this makes the
reference invalid.

—Sam

-- 
Sam Whited
_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
_______________________________________________

Reply via email to