On 12/02/18 23:55, isis agora lovecruft wrote:
>  1. What passes for "canonicalised" "utf-8" in C will be different to
>     what passes for "canonicalised" "utf-8" in Rust.  In C, the
>     following will not be allowed (whereas they are allowed in Rust):
>         - NUL (0x00)
>         - Byte Order Mark (0xFEFF)

Much of the metrics software is written in Java. Java strings allow for
NUL to appear, but assume that there is no BOM. If a BOM appears, then
this would be interpreted as data and, I assume, parsing would probably
fail. Should the whole document be rejected if it contains a NUL or BOM,
or should these values be stripped and then carry on parsing as if it
never happened?

>  2. Directory document keywords MUST be printable ASCII.

This can be validated. Should a single document keyword containing
printable non-ASCII be enough to reject the document, or should a parser
try to recover?

I'd really like to see a section in the proposal about how parsers
should react when they find something unexpected, otherwise all the
parsers may end up doing different things.

>  3. This change may break some descriptor/consensus/document parsers.
>     If you are the maintainer of a parser, you may want to start
>     thinking about this now.

For the metrics tools there are some guidelines on this we can follow:
https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other
language would be Python (for stem), but Python developers have probably
got a good understanding of unicode/str/bytes by now. (In Python 3: when
using UTF-8, BOM will not be stripped and will be interpreted as data,
and you can have a NUL in a str).


Attachment: signature.asc
Description: OpenPGP digital signature

tor-dev mailing list

Reply via email to