> On 13 Feb 2018, at 10:55, isis agora lovecruft <i...@torproject.org> wrote:
> 
> A couple outcomes of this:
> 
> 1. What passes for "canonicalised" "utf-8" in C will be different to
>    what passes for "canonicalised" "utf-8" in Rust.  In C, the
>    following will not be allowed (whereas they are allowed in Rust):
>        - NUL (0x00)
>        - Byte Order Mark (0xFEFF)

I want to clarify this point:

The Byte Order Mark is Unicode Scalar 0xFEFF, encoded in UTF-8 as the
bytes 0xEF 0xBB 0xBF.

Tor's C and Rust implementations of UTF-8 must be identical.

When we write the C implementation, we must reject NUL for
compatibility with C string functions.

When we write the Rust implementation, we must reject NUL for
compatibility with the C implementation. (Rust already implements
UTF-8 strings that accept NUL, so this will require custom code).

When we write the C and Rust implementations, we must reject BOM
because it's unnecessary. Rejecting BOM is recommended by the
relevant standard. (Rust already implements UTF-8 strings that accept
BOM, so this will require custom code).

T
_______________________________________________
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Reply via email to