Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

2018-02-13 Thread teor


> On 14 Feb 2018, at 11:03, Damian Johnson  wrote:
> 
>> For the metrics tools there are some guidelines on this we can follow:
>> https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other
>> language would be Python (for stem), but Python developers have probably
>> got a good understanding of unicode/str/bytes by now. (In Python 3: when
>> using UTF-8, BOM will not be stripped and will be interpreted as data,
>> and you can have a NUL in a str).
> 
> Hi Iain. Actually, for Stem I'm really looking forward to this too.
> Stem has special handling for the contact and platform fields (iirc
> the only spot non-ascii content can presently appear). Stem's parsers
> and API will be simplified once everything is uniformly utf-8. :P
> 
> Possibly a stupid question but any reason not to require the whole
> descriptor document to be printable characters?

Requiring printable ASCII throughout the document means that people
can't spell their names and email addresses correctly in contact lines.

Requiring printable unicode introduces a dependency on a particular
unicode version, because we don't know if unallocated blocks will be
printable or not.

I think we could make platform lines printable ASCII without losing
much. Unless there are platforms that have non-ASCII names?

T

--
Tim Wilson-Brown (teor)

teor2345 at gmail dot com
PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
ricochet:ekmygaiu4rzgsk6n







signature.asc
Description: Message signed with OpenPGP
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

2018-02-13 Thread Damian Johnson
> For the metrics tools there are some guidelines on this we can follow:
> https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other
> language would be Python (for stem), but Python developers have probably
> got a good understanding of unicode/str/bytes by now. (In Python 3: when
> using UTF-8, BOM will not be stripped and will be interpreted as data,
> and you can have a NUL in a str).

Hi Iain. Actually, for Stem I'm really looking forward to this too.
Stem has special handling for the contact and platform fields (iirc
the only spot non-ascii content can presently appear). Stem's parsers
and API will be simplified once everything is uniformly utf-8. :P

Possibly a stupid question but any reason not to require the whole
descriptor document to be printable characters?
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

2018-02-13 Thread teor

> On 13 Feb 2018, at 21:55, Iain Learmonth  wrote:
> 
> Hi,
> 
>> On 12/02/18 23:55, isis agora lovecruft wrote:
>> 1. What passes for "canonicalised" "utf-8" in C will be different to
>>what passes for "canonicalised" "utf-8" in Rust.  In C, the
>>following will not be allowed (whereas they are allowed in Rust):
>>- NUL (0x00)
>>- Byte Order Mark (0xFEFF)
> 
> Much of the metrics software is written in Java. Java strings allow for
> NUL to appear, but assume that there is no BOM. If a BOM appears, then
> this would be interpreted as data and, I assume, parsing would probably
> fail. Should the whole document be rejected if it contains a NUL or BOM,
> or should these values be stripped and then carry on parsing as if it
> never happened?

Directory authorities and bridge clients already reject descriptors that
contain NUL. (This is an artefact of the C implementation: the descriptor
is seen as truncated, so it won't parse.)

We should specify rejection for BOM as well.

>> 2. Directory document keywords MUST be printable ASCII.
> 
> This can be validated. Should a single document keyword containing
> printable non-ASCII be enough to reject the document, or should a parser
> try to recover?

If parsers want to be consistent with the Tor implementation, they should
reject.

> I'd really like to see a section in the proposal about how parsers
> should react when they find something unexpected, otherwise all the
> parsers may end up doing different things.

+1

>> 3. This change may break some descriptor/consensus/document parsers.
>>If you are the maintainer of a parser, you may want to start
>>thinking about this now.
> 
> For the metrics tools there are some guidelines on this we can follow:
> https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other
> language would be Python (for stem), but Python developers have probably
> got a good understanding of unicode/str/bytes by now. (In Python 3: when
> using UTF-8, BOM will not be stripped and will be interpreted as data,
> and you can have a NUL in a str).

Python for txtorcon
Rust for Tor's experimental protover implementation

And perhaps others:
https://stem.torproject.org/faq.html#are-there-any-other-controller-libraries
https://trac.torproject.org/projects/tor/wiki/doc/ListOfTorImplementations

T___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

2018-02-13 Thread Iain Learmonth
Hi,

On 12/02/18 23:55, isis agora lovecruft wrote:
>  1. What passes for "canonicalised" "utf-8" in C will be different to
> what passes for "canonicalised" "utf-8" in Rust.  In C, the
> following will not be allowed (whereas they are allowed in Rust):
> - NUL (0x00)
> - Byte Order Mark (0xFEFF)

Much of the metrics software is written in Java. Java strings allow for
NUL to appear, but assume that there is no BOM. If a BOM appears, then
this would be interpreted as data and, I assume, parsing would probably
fail. Should the whole document be rejected if it contains a NUL or BOM,
or should these values be stripped and then carry on parsing as if it
never happened?

>  2. Directory document keywords MUST be printable ASCII.

This can be validated. Should a single document keyword containing
printable non-ASCII be enough to reject the document, or should a parser
try to recover?

I'd really like to see a section in the proposal about how parsers
should react when they find something unexpected, otherwise all the
parsers may end up doing different things.

>  3. This change may break some descriptor/consensus/document parsers.
> If you are the maintainer of a parser, you may want to start
> thinking about this now.

For the metrics tools there are some guidelines on this we can follow:
https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other
language would be Python (for stem), but Python developers have probably
got a good understanding of unicode/str/bytes by now. (In Python 3: when
using UTF-8, BOM will not be stripped and will be interpreted as data,
and you can have a NUL in a str).

Thanks,
Iain.



signature.asc
Description: OpenPGP digital signature
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

2018-02-12 Thread teor

> On 13 Feb 2018, at 10:55, isis agora lovecruft  wrote:
> 
> A couple outcomes of this:
> 
> 1. What passes for "canonicalised" "utf-8" in C will be different to
>what passes for "canonicalised" "utf-8" in Rust.  In C, the
>following will not be allowed (whereas they are allowed in Rust):
>- NUL (0x00)
>- Byte Order Mark (0xFEFF)

I want to clarify this point:

The Byte Order Mark is Unicode Scalar 0xFEFF, encoded in UTF-8 as the
bytes 0xEF 0xBB 0xBF.

Tor's C and Rust implementations of UTF-8 must be identical.

When we write the C implementation, we must reject NUL for
compatibility with C string functions.

When we write the Rust implementation, we must reject NUL for
compatibility with the C implementation. (Rust already implements
UTF-8 strings that accept NUL, so this will require custom code).

When we write the C and Rust implementations, we must reject BOM
because it's unnecessary. Rejecting BOM is recommended by the
relevant standard. (Rust already implements UTF-8 strings that accept
BOM, so this will require custom code).

T
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

2018-02-12 Thread isis agora lovecruft
Hi!

The notes from this meeting are online. [0] Thanks to everyone who
attended!  Extra thanks to teor for conducting the meeting since I was
stupidly 8 minutes late due to impatiently watching a kettle boil
after eating very spicy cioppino and then *extremely* needing a glass
of iced tea immediately.

We found some issues w.r.t. the specifics of the proposal, but overall
we've agreed that it should be accepted in (roughly, after some minor
revision) in its current state.  As such, it is looking for someone
interested in implementing it!  (THIS COULD BE YOU)

A couple outcomes of this:

 1. What passes for "canonicalised" "utf-8" in C will be different to
what passes for "canonicalised" "utf-8" in Rust.  In C, the
following will not be allowed (whereas they are allowed in Rust):
- NUL (0x00)
- Byte Order Mark (0xFEFF)

 2. Directory document keywords MUST be printable ASCII.

 3. This change may break some descriptor/consensus/document parsers.
If you are the maintainer of a parser, you may want to start
thinking about this now.

[0]: 
http://meetbot.debian.net/tor-meeting/2018/tor-meeting.2018-02-12-21.04.html

Best regards,
-- 
 ♥Ⓐ isis agora lovecruft
_
OpenPGP: 4096R/0A6A58A14B5946ABDE18E207A3ADB67A2CDB8B35
Current Keys: https://fyb.patternsinthevoid.net/isis.txt


signature.asc
Description: Digital signature
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

2018-02-05 Thread isis agora lovecruft
This one is in #tor-meeting, next Monday, 12 February from 21:00-22:00 UTC.
In local times:

 * Monday, 12 February 13:00-14:00 PST
 * Monday, 12 February 16:00-17:00 EST
 * Monday, 12 February 22:00-23:00 CET
 * Tuesday, 13 February 08:00-09:00 AEST

isis agora lovecruft transcribed 2.3K bytes:
> Reminder to please vote for a time for this if you'd still like to attend!
> 
> isis agora lovecruft transcribed 2.2K bytes:
> > Hello,
> > 
> > Let's schedule a proposal discussion for prop#285 "Directory documents
> > should be standardized as UTF-8" [0] sometime between 12 - 13 Feb.  If
> > you're CCed, it's because you put your name down on the pad as being
> > interested in this discussion.  If anyone has requests or concerns, or if I
> > forgot to take your timezone into account, please let me know.
> > 
> > https://doodle.com/poll/cnc6scybbfpky5f8
> > 
> > [0]: https://gitweb.torproject.org/torspec.git/tree/proposals/285-utf-8.txt

-- 
 ♥Ⓐ isis agora lovecruft
_
OpenPGP: 4096R/0A6A58A14B5946ABDE18E207A3ADB67A2CDB8B35
Current Keys: https://fyb.patternsinthevoid.net/isis.txt


signature.asc
Description: Digital signature
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

2018-02-05 Thread isis agora lovecruft
Reminder to please vote for a time for this if you'd still like to attend!

isis agora lovecruft transcribed 2.2K bytes:
> Hello,
> 
> Let's schedule a proposal discussion for prop#285 "Directory documents
> should be standardized as UTF-8" [0] sometime between 12 - 13 Feb.  If
> you're CCed, it's because you put your name down on the pad as being
> interested in this discussion.  If anyone has requests or concerns, or if I
> forgot to take your timezone into account, please let me know.
> 
> https://doodle.com/poll/cnc6scybbfpky5f8
> 
> [0]: https://gitweb.torproject.org/torspec.git/tree/proposals/285-utf-8.txt

Best regards,
-- 
 ♥Ⓐ isis agora lovecruft
_
OpenPGP: 4096R/0A6A58A14B5946ABDE18E207A3ADB67A2CDB8B35
Current Keys: https://fyb.patternsinthevoid.net/isis.txt


signature.asc
Description: Digital signature
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


[tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8" (was: Nominate/vote for future proposal discussion meetings!)

2018-01-29 Thread isis agora lovecruft
Hello,

Let's schedule a proposal discussion for prop#285 "Directory documents
should be standardized as UTF-8" [0] sometime between 12 - 13 Feb.  If
you're CCed, it's because you put your name down on the pad as being
interested in this discussion.  If anyone has requests or concerns, or if I
forgot to take your timezone into account, please let me know.

https://doodle.com/poll/cnc6scybbfpky5f8

[0]: https://gitweb.torproject.org/torspec.git/tree/proposals/285-utf-8.txt

Best regards,
-- 
 ♥Ⓐ isis agora lovecruft
_
OpenPGP: 4096R/0A6A58A14B5946ABDE18E207A3ADB67A2CDB8B35
Current Keys: https://fyb.patternsinthevoid.net/isis.txt


signature.asc
Description: Digital signature
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev