Re: New character encoding conversion API

2017-06-16 Thread Henri Sivonen
On Thu, Jun 15, 2017 at 3:58 PM, Nathan Froyd  wrote:
> Can you file a bug so `mach vendor rust` complains about vendoring
> rust-encoding?

Filed https://bugzilla.mozilla.org/show_bug.cgi?id=1373554

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: New character encoding conversion API

2017-06-15 Thread Simon Sapin

On 15/06/2017 12:32, Henri Sivonen wrote:

  * We don't have third-party crates in m-c that (unconditionally)
require rust-encoding. However, if you need to import such a crate and
it's infeasible to make it use encoding_rs directly, please do not
vendor rust-encoding into the tree. Vendoring rust-encoding into the
tree would bring in another set of lookup tables, which encoding_rs is
specifically trying to avoid. I have a compatibily shim ready in case
the need to vendor rust-encoding-dependent crates arises.
https://github.com/hsivonen/encoding_rs_compat


I’ve recently made the rust-encoding dependency in Tendril optional and 
disabled by default. (Tendril is the refcounted buffer type used in 
html5ever, Servo’s HTML parser.) So this won’t be an issue if we ever 
want to use html5ever in m-c.


I’ve also removed the "parse from bytes" (as opposed to Unicode ) 
API from html5ever because it couldn’t support interrupting parsing on 
 correctly. When adding something again we’ll do it using 
encoding_rs.


--
Simon Sapin
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: New character encoding conversion API

2017-06-15 Thread Nathan Froyd
On Thu, Jun 15, 2017 at 6:32 AM, Henri Sivonen  wrote:
> encoding_rs landed delivering correctness, safety, performance and
> code size benefits as well as new functionality.

Thanks for working on this.

>  * We don't have third-party crates in m-c that (unconditionally)
> require rust-encoding. However, if you need to import such a crate and
> it's infeasible to make it use encoding_rs directly, please do not
> vendor rust-encoding into the tree. Vendoring rust-encoding into the
> tree would bring in another set of lookup tables, which encoding_rs is
> specifically trying to avoid.

Can you file a bug so `mach vendor rust` complains about vendoring
rust-encoding?

Thanks,
-Nathan
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


New character encoding conversion API

2017-06-15 Thread Henri Sivonen
encoding_rs landed delivering correctness, safety, performance and
code size benefits as well as new functionality. Here's a summary of
need-to-know stuff from the perspective of using it.

The docs for the Rust-visible API are at: https://docs.rs/encoding_rs/
The docs for the C++-visible API are at:
https://searchfox.org/mozilla-central/source/intl/Encoding.h#100

The docs for the Rust-visible API also explain some design decisions.
The docs also say how the API maps to the concepts of the Encoding
Standard.

 * We now have the capability of decoding external text directly into
UTF-8 and encoding text directly from UTF-8. This is a genuine
direct-to-UTF-8 capability that does not pivot through UTF-16 buffers.
If you're writing new code that takes textual input from external
sources, please make your code operate on UTF-8 internally instead of
making it operate on UTF-16. This way, the common decode case becomes
mere validation and the parser-sensitive syntax (ASCII in Web formats)
takes half the space.

 * nsIUnicodeDecoder and nsIUnicodeEncoder no longer exist and have
been replaced with mozilla::Decoder and mozilla::Encoder,
respectively. (encoding_rs::Decoder and encoding_rs::Encoder in Rust.)

 * The above two types only need to be used for streaming conversions.
You no longer need to implement non-streaming conversions yourself on
top of the streaming converters. Instead, mozilla::Encoding (C++; both
nsAString and nsACString overloads for UTF-16 and UTF-8, respectively)
and encoding_rs::Encoding (Rust; UTF-8 only) provide methods for
non-streaming conversions and these methods take care of avoiding
copies when possible. (If you need to work with XPCOM strings from
Rust, there are functions in the encoding_glue crate. They haven't
been grouped into a trait but could be if deemed necessary/useful.)

 * There is now a type-safe representation for the concept of an
encoding: const mozilla::Encoding* in C++ and &'static
encoding_rs::Encoding in Rust.

   - The two are toll-free bridged: They are the same thing. I.e. when
crossing the FFI, write const mozilla::Encoding* on the C++ side and
*const encoding_rs::Encoding on the Rust side of FFI.

   - The referents are statically allocated, so there's no need to
refcount and using the plain pointers is really OK in C++.

   - Given that we now have a type-safe representation for the concept
of an encoding, where possible, please use const mozilla::Encoding*
mEncoding instead of nsCString mCharset to represent the concept of an
encoding in new code. mozilla::Encoding::ForName() and
mozilla::Encoding::Name() provide interop between the old and new
ways.

   - For each encoding, there's a type-safe constant for referring to
the encoding. To refer to UTF-8 from C++, use UTF_8_ENCODING. From
Rust, use encoding_rs::UTF_8.

 * The new API provides the full set of options for handling the BOM
correctly upon decode. Please pick the right one of the three options:
the default (BOM sniffing with the decoder potentially morphing into a
decoder for the BOM-indicated encoding), BOM removal (no decoder
morphing) and no BOM handling (the BOM is handled like any other input
bytes).

 * The new API handles the end of the stream correctly. Please
actually let the decoder know about the end of the stream when using
streaming decoding.

 * You no longer need to implement replacement of unmappable
characters yourself. The decoders generate REPLACEMENT CHARACTERs for
you by default and the encoders generate HTML number character
references for you by default. These are the only modes that exist in
the Web Platform, so other replacements are not supported (though it's
possible to implement other replacement on top of the API entry points
that do not perform replacement). The API lets you know if there where
any of these replacements, so you can use this information to e.g.
whine to console without having to take over implementing the
replacement yourself just because you want to know if there were any
replacements.

 * For old-style type-unsafe use of encoding name in nsACString to
represent the concept of an encoding, the set of canonical names is
now exactly the set of names from the WHATWG Encoding Standard. This
means that:

   - ISO-8859-1 is no longer a Gecko-canonical name. Use windows-1252
instead. (I forgot to fix the remaining instances. The follow-up patch
in https://bugzilla.mozilla.org/show_bug.cgi?id=1372994)

   - gbk is no longer a Gecko-canonical name. The new canonical name is GBK.

   - UTF-16 is no longer a Gecko-canonical name. Use UTF-16LE instead.

 * Encoding to UTF-16 (LE or BE) is no longer supported. That is,
Gecko no longer has the capability of generating a _byte_ stream for
_interchange_ in an UTF-16 encoding. (Decoding into _in-RAM_ UTF-16 as
a stream of _16-bit units_ is, of course, supported.)

 * The encoders and decoders have no Reset() method. If you need a
converter to go back to its start state, just create a new one. It's
cheap. The creation