Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

Iain Learmonth Tue, 13 Feb 2018 02:56:08 -0800

Hi,

On 12/02/18 23:55, isis agora lovecruft wrote:
>  1. What passes for "canonicalised" "utf-8" in C will be different to
>     what passes for "canonicalised" "utf-8" in Rust.  In C, the
>     following will not be allowed (whereas they are allowed in Rust):
>         - NUL (0x00)
>         - Byte Order Mark (0xFEFF)


Much of the metrics software is written in Java. Java strings allow for
NUL to appear, but assume that there is no BOM. If a BOM appears, then
this would be interpreted as data and, I assume, parsing would probably
fail. Should the whole document be rejected if it contains a NUL or BOM,
or should these values be stripped and then carry on parsing as if it
never happened?

>  2. Directory document keywords MUST be printable ASCII.

This can be validated. Should a single document keyword containing
printable non-ASCII be enough to reject the document, or should a parser
try to recover?

I'd really like to see a section in the proposal about how parsers
should react when they find something unexpected, otherwise all the
parsers may end up doing different things.

>  3. This change may break some descriptor/consensus/document parsers.
>     If you are the maintainer of a parser, you may want to start
>     thinking about this now.

For the metrics tools there are some guidelines on this we can follow:
https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other
language would be Python (for stem), but Python developers have probably
got a good understanding of unicode/str/bytes by now. (In Python 3: when
using UTF-8, BOM will not be stripped and will be interpreted as data,
and you can have a NUL in a str).

Thanks,
Iain.

signature.asc
Description: OpenPGP digital signature

_______________________________________________
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

Reply via email to