Re: [rust-dev] Proposed API for character encodings
Le 21/09/2013 16:38, Olivier Renaud a écrit : I'd expect this offset to be absolute. After all, the only thing that the programmer can do with this information at this point is to report it to the user ; if the programmer wanted to handle the error, he could have done it by using a trap. A relative offset has no meaning outside of the processing loop, whereas an absolute offset can still be useful even outside of the program (if the source of the stream is a file, then an absolute offset will give the exact location of the error in the file). A counter is super cheap, I would'nt worry about its cost. Actually, it just has to be incremented once for each call to 'feed'. Well to get the position inside a given chunk of input you still have to count individual bytes. (Maybe with Iterator::enumerate?) Unless maybe we do dirty pointer arithmetic… If possible, I’d rather find a way to not have to pay that cost in the common case where the error handling is *not* abort and DecodeError is never used. This is also a bit annoying as each implementation will have to repeat the counting logic, but maybe it’s still worth it. Note : for the encoder, you will have to specify wether the offset is a 'code point' count or a 'code unit' count. Yes. I don’t know yet. If we do [1] and make the input generic it will probably have to be code points. [1] https://mail.mozilla.org/pipermail/rust-dev/2013-September/005662.html Otherwise, it may be preferable to match Str::slice and count UTF-8 bytes. (Which I suppose is what you call code units?) -- Simon Sapin ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Proposed API for character encodings
Le 20/09/2013 20:07, Olivier Renaud a écrit : I have one more question regarding the error handling : in DecodeError, what does 'input_byte_offset' mean ? Is it relative to the 'invalid_byte_sequence' or to the beginning of the decoded stream ? Good point. I’m not sure. (Remember I make this up as we go along :).) If it’s from the entirety of the input this would require decoders to keep count, which is unnecessary work in cases where you don’t use it. (eg. with the Replace error handling.) So it could be from the beginning of the input in the last call to .feed() to the begining of the invalid byte sequence, *which can be negative*, in case the invalid sequence started in an earlier .feed() call. What do you think it should be? -- Simon Sapin ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Proposed API for character encodings
I really like the API you are proposing. In particular, the error handling is close to what I was expecting from such an API. I have some remarks, though. Is there a reason for encoders and decoders to not be reusable ? I think it would be reasonable to specify that they get back to their initial state once the 'flush' method is called, or when a 'DecodeError' is returned. Is a condition raised when the order of method calls is not respected ? E.g. if one calls 'flush' multiple times, of calls 'feed' and then 'decode' ? It is not clear what is given as a parameter to the 'decoding_error' condition. I guess it's the exact subset of byte sequence that cannot be decoded, possibly spanning multiple 'feed' calls. Is that correct ? Is it sufficient for variable-length encodings ? I am doubtful that the encoder is just a decoder with [u8] and str swapped. A decoder must deal with a possibly invalid sequence of bytes, while an encoder deals with str, which is guaranteed to be a valid utf8 sequence. An encoder must handle unmappable characters, whereas a decoder doesn't (actually, it depends whether we consider unicode to be universal or not...). I think it would be a good idea to make a difference between an invalid sequence and an unmappable character. I think there should be both an 'invalid_sequence' and an 'unmappable_char' condition. Also, the 'fatal' handler is a bit scary, based on the name I'd expect it to result in a 'fail!'. I propose this set of conditions and handlers : // Decoder conditions condition! { /// The byte sequence is not a valid input pub invalid_sequence : ~[u8] - Option~str; /// The byte sequence cannot be represented in Unicode (rarely used) pub unmappable_bytes : ~[u8] - Option~str; } // Encoder condition condition! { /// The Unicode string cannot be represented in the target encoding /// (essential for single byte encodings) pub unmappable_str : ~str - Option~[u8]; } /// Functions to be used with invalid_sequence::cond.trap /// or unmappable_bytes::cond.trap mod decoding_error_handlers { fn decoder_error(_: ~[u8]) - Option~str { None } fn replacement(_: ~[u8]) - Option~str { Some(~\uFFFD) } fn ascii_substitute(_: ~[u8]) - Option~str { Some(~\u001A) } fn ignore(_: ~[u8]) - Option~str { Some(~) } } /// Functions to be used with unmappable_str::cond.trap mod encoding_error_handlers { fn decoder_error(_: ~str) - Option~[u8] { None } fn ascii_substitute(_: ~str) - Option~[u8] { Some(~[0x1A]) } fn ignore(_: ~str) - Option~[u8] { Some(~[]) } } Not sure about this substitute/replacement duality. Maybe we can have only one function name 'default', that would be FFFD for unicode and 1A for ascii. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Proposed API for character encodings
Le 20/09/2013 10:18, Olivier Renaud a écrit : I really like the API you are proposing. In particular, the error handling is close to what I was expecting from such an API. I have some remarks, though. Is there a reason for encoders and decoders to not be reusable ? I think it would be reasonable to specify that they get back to their initial state once the 'flush' method is called, or when a 'DecodeError' is returned. I don’t have a strong opinion on that. There could be a reset or similar method, but I don’t see how this is better than just throwing the decoder away and making a new one. With static dispatch and the encoding known at compile-time, you can probably have decoders on the stack so making a new one is cheap. If the encoding is determined at run-time and you use trait objects (dynamic dispatch) for decoders, the next input might have a different encoding so reusing decoders might not be useful either. Is a condition raised when the order of method calls is not respected ? E.g. if one calls 'flush' multiple times, of calls 'feed' and then 'decode' ? Decoder::decode is a static method / associated function. It’s independent from everything else. Other than that, I don’t know. rust-encoding doesn’t do that. AFAIU it leaves this behavior undefined, which I think is fine. Do you think it should be explicitly checked for? It is not clear what is given as a parameter to the 'decoding_error' condition. I guess it's the exact subset of byte sequence that cannot be decoded, possibly spanning multiple 'feed' calls. Is that correct ? Is it sufficient for variable-length encodings ? Correct, and I think yes. It is called once every time the spec says to run the error algorithm: http://encoding.spec.whatwg.org/#error I am doubtful that the encoder is just a decoder with [u8] and str swapped. A decoder must deal with a possibly invalid sequence of bytes, while an encoder deals with str, which is guaranteed to be a valid utf8 sequence. An encoder must handle unmappable characters, whereas a decoder doesn't You’re right, I cut some corners. In particular, the encoding_error condition can take a single (unsupported) 'char'. Other than that, the *API* is (very close to?) the same with [u8] and str swapped. (actually, it depends whether we consider unicode to be universal or not...). I suggest we consider it is. (For the purpose of the WHATWG spec it is.) If Unicode is missing things, the right solution is to add things to Unicode. I think it would be a good idea to make a difference between an invalid sequence and an unmappable character. I think there should be both an 'invalid_sequence' and an 'unmappable_char' condition. That’s the distinction between decoding_error and encoding_error, which already exists. Also, the 'fatal' handler is a bit scary, based on the name I'd expect it to result in a 'fail!'. I’m open to other names. Maybe abort? The idea is that you reject the entirety of this input (including previous successful calls to .feed()) I propose this set of conditions and handlers : // Decoder conditions condition! { /// The byte sequence is not a valid input pub invalid_sequence : ~[u8] - Option~str; /// The byte sequence cannot be represented in Unicode (rarely used) pub unmappable_bytes : ~[u8] - Option~str; } // Encoder condition condition! { /// The Unicode string cannot be represented in the target encoding /// (essential for single byte encodings) pub unmappable_str : ~str - Option~[u8]; } I think that unmappable_bytes is not needed, and the other two should just be decoding_error and encoding_error. (See above.) /// Functions to be used with invalid_sequence::cond.trap /// or unmappable_bytes::cond.trap mod decoding_error_handlers { fn decoder_error(_: ~[u8]) - Option~str { None } fn replacement(_: ~[u8]) - Option~str { Some(~\uFFFD) } fn ascii_substitute(_: ~[u8]) - Option~str { Some(~\u001A) } fn ignore(_: ~[u8]) - Option~str { Some(~) } } /// Functions to be used with unmappable_str::cond.trap mod encoding_error_handlers { fn decoder_error(_: ~str) - Option~[u8] { None } fn ascii_substitute(_: ~str) - Option~[u8] { Some(~[0x1A]) } fn ignore(_: ~str) - Option~[u8] { Some(~[]) } } Not sure about this substitute/replacement duality. Maybe we can have only one function name 'default', that would be FFFD for unicode and 1A for ascii. I think we should only provide two handlers for each of decoding and encoding: fail/abort/error, and replace. The latter is U+FFFD (replacement character) for decoding and 0x3F (ASCII question mark) for encoding as in the WHATWG spec, per web-compatibility constraints. In particular, ignore is terrible and should not be encouraged. (Depending on what you’re doing with it, it could lead to security issues.) If you do want ignore or ASCII substitute, writing a custom condition handler is easy
Re: [rust-dev] Proposed API for character encodings
Le 20/09/2013 13:40, Henri Sivonen a écrit : On Tue, Sep 10, 2013 at 6:47 PM, Simon Sapinsimon.sa...@exyr.org wrote: /// Call this to indicate the end of the input. /// The Decoder instance should be discarded afterwards. /// Some encodings may append some final output at this point. /// May raise the decoding_error condition. fn flush(output: mut ~str) - OptionDecodeError; Please call this finish instead of calling it flush. In other APIs, for example JDK APIs, flush really just means flushing the current buffers instead of ending the stream, so calling the method that does end-of-stream processing flush would be confusing. flush is the name that rust-encoding uses, but I argee that finish is better for what it does. -- Simon Sapin ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Proposed API for character encodings
Le 10/09/2013 16:47, Simon Sapin a écrit : TR;DR: the actual proposal is at the end of this email. I moved this to the wiki, to better deal with updates: https://github.com/mozilla/rust/wiki/Proposal-for-character-encoding-API -- Simon Sapin ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Proposed API for character encodings
Le vendredi 20 septembre 2013 11:47:04 Simon Sapin a écrit : Le 20/09/2013 10:18, Olivier Renaud a écrit : I really like the API you are proposing. In particular, the error handling is close to what I was expecting from such an API. I have some remarks, though. Is there a reason for encoders and decoders to not be reusable ? I think it would be reasonable to specify that they get back to their initial state once the 'flush' method is called, or when a 'DecodeError' is returned. I don’t have a strong opinion on that. There could be a reset or similar method, but I don’t see how this is better than just throwing the decoder away and making a new one. I don't see the need for a 'reset' method. A decoder could return to its initial state after a call to 'finish'. With static dispatch and the encoding known at compile-time, you can probably have decoders on the stack so making a new one is cheap. If the encoding is determined at run-time and you use trait objects (dynamic dispatch) for decoders, the next input might have a different encoding so reusing decoders might not be useful either. My typical usage of a charset decoder is to read many files on disk, all of them using the same charset. Is a condition raised when the order of method calls is not respected ? E.g. if one calls 'flush' multiple times, of calls 'feed' and then 'decode' ? Decoder::decode is a static method / associated function. It’s independent from everything else. Oh yes of course, my bad. Other than that, I don’t know. rust-encoding doesn’t do that. AFAIU it leaves this behavior undefined, which I think is fine. Do you think it should be explicitly checked for? Well, in my opinion it is not a good idea for an API to have undefined behavior. Being explicit about what is disallowed also helps the user to understand how the API is supposed to be used. Also, I think it's preferable to fail fast, when the state of an object becomes invalid. There are a handful of reasonable behavior, for the decoder : * If reusing a decoder is legal, then calling 'feed' after 'finish' is legal (we start decoding a new stream), no need to introduce a special case. A second call to 'finish' can be a noop (we decode an empty stream) * If reusing a decoder is illegal: -- Calling 'feed' after 'finish' should be an error. The API must report that it is being misused by the programmer. I don't know what is the recommended way to do that in Rust. I think it's ok to fail!, or to have an assert. In Java, I'd throw an (unchecked) IllegalStateException, which serves exactly this purpose. -- Calling 'finish' a second time can also be a noop, but it would be better to be consistent with the 'feed' after 'finish' behavior and to fail. Another totally different solution would be to use phantom types, to indicate the state of the decoder, but that would be overkill. Or typestates :) Simpler is better, so I think having a reusable decoder with no special invalid state is the least problematic solution. It is not clear what is given as a parameter to the 'decoding_error' condition. I guess it's the exact subset of byte sequence that cannot be decoded, possibly spanning multiple 'feed' calls. Is that correct ? Is it sufficient for variable-length encodings ? Correct, and I think yes. It is called once every time the spec says to run the error algorithm: http://encoding.spec.whatwg.org/#error I am doubtful that the encoder is just a decoder with [u8] and str swapped. A decoder must deal with a possibly invalid sequence of bytes, while an encoder deals with str, which is guaranteed to be a valid utf8 sequence. An encoder must handle unmappable characters, whereas a decoder doesn't You’re right, I cut some corners. In particular, the encoding_error condition can take a single (unsupported) 'char'. Other than that, the *API* is (very close to?) the same with [u8] and str swapped. (actually, it depends whether we consider unicode to be universal or not...). I suggest we consider it is. (For the purpose of the WHATWG spec it is.) If Unicode is missing things, the right solution is to add things to Unicode. It simplifies many things, indeed. [...] ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Proposed API for character encodings
Le vendredi 20 septembre 2013 11:52:14 Simon Sapin a écrit : Le 13/09/2013 23:03, Simon Sapin a écrit : /// Takes the invalid byte sequence. /// Return a replacement string, or None to abort with a DecodeError. condition! { pub decoding_error : ~[u8] - Option~str; } /// Functions to be used with decoding_error::cond.trap mod decoding_error_handlers { fn fatal(_: ~[u8]) - Option~str { None } fn replacement(_: ~[u8]) - Option~str { Some(~\uFFFD) } } Allocating ~\uFFFD repeatedly is, let’s say, unfortunate. This could be avoided by having the return value be: enum DecodingErrorResult { AbortDecoding, ReplacementString(~str), ReplacementChar(char), } Similarly, for encoding: enum EncodingErrorResult { AbortDecoding, ReplacamentByteSequence(~[u8]), ReplacementByte(u8), } That's a nice addition, it's even better this way ! I have one more question regarding the error handling : in DecodeError, what does 'input_byte_offset' mean ? Is it relative to the 'invalid_byte_sequence' or to the beginning of the decoded stream ? ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Proposed API for character encodings
On Tue, Sep 10, 2013 at 6:47 PM, Simon Sapin simon.sa...@exyr.org wrote: /// Call this to indicate the end of the input. /// The Decoder instance should be discarded afterwards. /// Some encodings may append some final output at this point. /// May raise the decoding_error condition. fn flush(output: mut ~str) - OptionDecodeError; Please call this finish instead of calling it flush. In other APIs, for example JDK APIs, flush really just means flushing the current buffers instead of ending the stream, so calling the method that does end-of-stream processing flush would be confusing. -- Henri Sivonen hsivo...@hsivonen.fi http://hsivonen.iki.fi/ ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Proposed API for character encodings
Le 19/09/2013 13:39, Jeffery Olson a écrit : As to the implementation: rust-encoding has a lot that could be adapted. https://github.com/__lifthrasiir/rust-encoding https://github.com/lifthrasiir/rust-encoding Can someone comment on whether we should look at adapting what's in str::from_utf8 (really, str::raw::from_buf_len is where the action is) and str::from_utf16 for this? Everyone in IRC I ask says that they are correct.. they're also highly optimized.. are they appropriate for this API? And if not, are comfortable having two totally separate paths for string decoding? I don’t think anybody is advocating duplicating implementations of the same thing. My understanding is that UTF8Decoder and the existing API in std::str will end up calling the same code. That code could be libstd’s existing implementation extended for error handling, or rust-encoding’s, or something else. I don’t have a strong opinion about it. UTF-16 is a bit special, because libstd’s existing APIs deal with native-endian [u16], while encoding APIs will need both UTF-16-LE and UTF-16-BE on [u8]. I don’t know how much can be shared. But once again, I’m more interested in getting the API and the behavior right. I trust the smart people working on Rust to refactor and optimize the implementation over time. Cheers, -- Simon Sapin ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] Proposed API for character encodings
On 09/10/2013 08:47 AM, Simon Sapin wrote: Hi, TR;DR: the actual proposal is at the end of this email. Thanks for working on this. It's crucial. Rust today has good support for UTF-8 which new content definitely should use, but many systems still have to deal with legacy content that uses other character encodings. There are several projects around to implement more encodings in Rust. The most further along in my opinion is rust-encoding, notably because it implements the right specification. rust-encoding: https://github.com/lifthrasiir/rust-encoding The spec: http://encoding.spec.whatwg.org/ It has more precise definitions of error handling than some original RFCs, and better reflects the reality of legacy content on the web. There was some discussion in the past few days about importing rust-encoding (or part of it) into Rust’s libstd or libextra. Before that, I think it is important to define a good API. The spec defines one for JavaScript, but we should not copy that exactly. rust-encoding’s API is mostly good, but I think that error handling could be simplified. In abstract terms, an encoding (such as UTF-8) is made of a decoder and an encoder. A decoder converts a stream of bytes into a stream of text (Unicode scalar values, ie. code points excluding surrogates), while an encoder does the reverse. This does not cover other kinds of stream transformation such as base64, compression, encryption, etc. Bytes are represented in Rust by u8, text by str/char. (Side note: Because of constraints imposed by JavaScript and to avoid costly conversions, Servo will probably use a different data type for representing text. This encoding API could eventually become generic over a Text trait, but I think that it should stick to str for now.) The most convenient way to represent a stream is with a vector or string. This however requires the whole input to be in memory before decoding/encoding can start, and that to be finished before any of the output can be used. It should definitely be possible to eg. decode some content as it arrives from the network, and parse it in a pipeline. The most fundamental type API is one where the user repeatedly pushes chunks of input into a decoder/encoders object (that may maintain state between chunks) and gets the output so far in return, then signals the end of the input. In iterator adapter where the users pulls output from the decoder which pulls from the input can be nicer, but is easy to build on top of a push-based API, while the reverse requires tasks. Iteratoru8 and Iteratorchar are tempting, but we may need to work on big chucks at a time for efficiency: Iterator~[u8] and Iterator~str. Or could single-byte/char iterators be reliably inlined to achieve similar efficiency? Can Iterator[u8] work if the iterator itself contains a fixed-sized or preallocated buffer? For I/O purposes, allocating a bunch of buffers just to write them out to a stream sounds wasteful.. Finally, this API also needs to support several kinds of errors handling. For example, a decoder should abort at the invalid byte sequence for XML, but insert U+FFFD (replacement character) for HTML. I’m not decided yet whether to just have the closed set of error handling modes defined in the spec, or make this open-ended with conditions. Based on all the above, here is a proposed API. Encoders are ommited, but they are mostly the same as decoders with [u8] and str swapped. /// Types implementing this trait are algorithms /// such as UTF8, UTF-16, SingleByteEncoding, etc. /// Values of these types are encodings as defined in the WHATWG spec: /// UTF-8, UTF-16-LE, Windows-1252, etc. trait Encoding { /// Could become an associated type with a ::new() constructor /// when the language supports that. fn new_decoder(self) - ~Decoder; /// Simple, one shot API. /// Decode a single byte string that is entirely in memory. /// May raise the decoding_error condition. fn decode(self, input: [u8]) - Result~str, DecodeError { // Implementation (using a Decoder) left out. // This is a default method, but not meant to be overridden. } } /// Takes the invalid byte sequence. /// Return a replacement string, or None to abort with a DecodeError. condition! { pub decoding_error : ~[u8] - Option~str; } struct DecodeError { input_byte_offset: uint, invalid_byte_sequence: ~[u8], } /// Each implementation of Encoding has one corresponding implementation /// of Decoder (and one of Encoder). /// /// A new Decoder instance should be used for every input. /// A Decoder instance should be discarded after DecodeError was returned. trait Decoder { /// Call this repeatedly with a chunck of input bytes. /// As much as possible of the decoded text is appended to output. /// May raise the decoding_error condition. fn feed(input: [u8], output: mut ~str) - OptionDecodeError; /// Call this
Re: [rust-dev] Proposed API for character encodings
On 09/10/2013 05:47 PM, Simon Sapin wrote: Hi, TR;DR: the actual proposal is at the end of this email. Rust today has good support for UTF-8 which new content definitely should use, but many systems still have to deal with legacy content that uses other character encodings. There are several projects around to implement more encodings in Rust. The most further along in my opinion is rust-encoding, notably because it implements the right specification. rust-encoding: https://github.com/lifthrasiir/rust-encoding The spec: http://encoding.spec.whatwg.org/ It has more precise definitions of error handling than some original RFCs, and better reflects the reality of legacy content on the web. There was some discussion in the past few days about importing rust-encoding (or part of it) into Rust’s libstd or libextra. Before that, I think it is important to define a good API. The spec defines one for JavaScript, but we should not copy that exactly. rust-encoding’s API is mostly good, but I think that error handling could be simplified. In abstract terms, an encoding (such as UTF-8) is made of a decoder and an encoder. A decoder converts a stream of bytes into a stream of text (Unicode scalar values, ie. code points excluding surrogates), while an encoder does the reverse. This does not cover other kinds of stream transformation such as base64, compression, encryption, etc. Bytes are represented in Rust by u8, text by str/char. (Side note: Because of constraints imposed by JavaScript and to avoid costly conversions, Servo will probably use a different data type for representing text. This encoding API could eventually become generic over a Text trait, but I think that it should stick to str for now.) The most convenient way to represent a stream is with a vector or string. This however requires the whole input to be in memory before decoding/encoding can start, and that to be finished before any of the output can be used. It should definitely be possible to eg. decode some content as it arrives from the network, and parse it in a pipeline. The most fundamental type API is one where the user repeatedly pushes chunks of input into a decoder/encoders object (that may maintain state between chunks) and gets the output so far in return, then signals the end of the input. In iterator adapter where the users pulls output from the decoder which pulls from the input can be nicer, but is easy to build on top of a push-based API, while the reverse requires tasks. Iteratoru8 and Iteratorchar are tempting, but we may need to work on big chucks at a time for efficiency: Iterator~[u8] and Iterator~str. Or could single-byte/char iterators be reliably inlined to achieve similar efficiency? Finally, this API also needs to support several kinds of errors handling. For example, a decoder should abort at the invalid byte sequence for XML, but insert U+FFFD (replacement character) for HTML. I’m not decided yet whether to just have the closed set of error handling modes defined in the spec, or make this open-ended with conditions. Based on all the above, here is a proposed API. Encoders are ommited, but they are mostly the same as decoders with [u8] and str swapped. /// Types implementing this trait are algorithms /// such as UTF8, UTF-16, SingleByteEncoding, etc. /// Values of these types are encodings as defined in the WHATWG spec: /// UTF-8, UTF-16-LE, Windows-1252, etc. trait Encoding { /// Could become an associated type with a ::new() constructor /// when the language supports that. fn new_decoder(self) - ~Decoder; /// Simple, one shot API. /// Decode a single byte string that is entirely in memory. /// May raise the decoding_error condition. fn decode(self, input: [u8]) - Result~str, DecodeError { // Implementation (using a Decoder) left out. // This is a default method, but not meant to be overridden. } } /// Takes the invalid byte sequence. /// Return a replacement string, or None to abort with a DecodeError. condition! { pub decoding_error : ~[u8] - Option~str; } struct DecodeError { input_byte_offset: uint, invalid_byte_sequence: ~[u8], } /// Each implementation of Encoding has one corresponding implementation /// of Decoder (and one of Encoder). /// /// A new Decoder instance should be used for every input. /// A Decoder instance should be discarded after DecodeError was returned. trait Decoder { /// Call this repeatedly with a chunck of input bytes. /// As much as possible of the decoded text is appended to output. /// May raise the decoding_error condition. fn feed(input: [u8], output: mut ~str) - OptionDecodeError; /// Call this to indicate the end of the input. /// The Decoder instance should be discarded afterwards. /// Some encodings may append some final output at this point. /// May raise the decoding_error condition. fn flush(output: mut ~str) -
Re: [rust-dev] Proposed API for character encodings
Le 11/09/2013 17:19, Marvin Löbel a écrit : On 09/10/2013 05:47 PM, Simon Sapin wrote: Hi, TR;DR: the actual proposal is at the end of this email. Rust today has good support for UTF-8 which new content definitely should use, but many systems still have to deal with legacy content that uses other character encodings. There are several projects around to implement more encodings in Rust. The most further along in my opinion is rust-encoding, notably because it implements the right specification. rust-encoding: https://github.com/lifthrasiir/rust-encoding The spec: http://encoding.spec.whatwg.org/ It has more precise definitions of error handling than some original RFCs, and better reflects the reality of legacy content on the web. There was some discussion in the past few days about importing rust-encoding (or part of it) into Rust’s libstd or libextra. Before that, I think it is important to define a good API. The spec defines one for JavaScript, but we should not copy that exactly. rust-encoding’s API is mostly good, but I think that error handling could be simplified. In abstract terms, an encoding (such as UTF-8) is made of a decoder and an encoder. A decoder converts a stream of bytes into a stream of text (Unicode scalar values, ie. code points excluding surrogates), while an encoder does the reverse. This does not cover other kinds of stream transformation such as base64, compression, encryption, etc. Bytes are represented in Rust by u8, text by str/char. (Side note: Because of constraints imposed by JavaScript and to avoid costly conversions, Servo will probably use a different data type for representing text. This encoding API could eventually become generic over a Text trait, but I think that it should stick to str for now.) The most convenient way to represent a stream is with a vector or string. This however requires the whole input to be in memory before decoding/encoding can start, and that to be finished before any of the output can be used. It should definitely be possible to eg. decode some content as it arrives from the network, and parse it in a pipeline. The most fundamental type API is one where the user repeatedly pushes chunks of input into a decoder/encoders object (that may maintain state between chunks) and gets the output so far in return, then signals the end of the input. In iterator adapter where the users pulls output from the decoder which pulls from the input can be nicer, but is easy to build on top of a push-based API, while the reverse requires tasks. Iteratoru8 and Iteratorchar are tempting, but we may need to work on big chucks at a time for efficiency: Iterator~[u8] and Iterator~str. Or could single-byte/char iterators be reliably inlined to achieve similar efficiency? Finally, this API also needs to support several kinds of errors handling. For example, a decoder should abort at the invalid byte sequence for XML, but insert U+FFFD (replacement character) for HTML. I’m not decided yet whether to just have the closed set of error handling modes defined in the spec, or make this open-ended with conditions. Based on all the above, here is a proposed API. Encoders are ommited, but they are mostly the same as decoders with [u8] and str swapped. /// Types implementing this trait are algorithms /// such as UTF8, UTF-16, SingleByteEncoding, etc. /// Values of these types are encodings as defined in the WHATWG spec: /// UTF-8, UTF-16-LE, Windows-1252, etc. trait Encoding { /// Could become an associated type with a ::new() constructor /// when the language supports that. fn new_decoder(self) - ~Decoder; /// Simple, one shot API. /// Decode a single byte string that is entirely in memory. /// May raise the decoding_error condition. fn decode(self, input: [u8]) - Result~str, DecodeError { // Implementation (using a Decoder) left out. // This is a default method, but not meant to be overridden. } } /// Takes the invalid byte sequence. /// Return a replacement string, or None to abort with a DecodeError. condition! { pub decoding_error : ~[u8] - Option~str; } struct DecodeError { input_byte_offset: uint, invalid_byte_sequence: ~[u8], } /// Each implementation of Encoding has one corresponding implementation /// of Decoder (and one of Encoder). /// /// A new Decoder instance should be used for every input. /// A Decoder instance should be discarded after DecodeError was returned. trait Decoder { /// Call this repeatedly with a chunck of input bytes. /// As much as possible of the decoded text is appended to output. /// May raise the decoding_error condition. fn feed(input: [u8], output: mut ~str) - OptionDecodeError; /// Call this to indicate the end of the input. /// The Decoder instance should be discarded afterwards. /// Some encodings may append some final output at this point. /// May raise the decoding_error condition. fn
[rust-dev] Proposed API for character encodings
Hi, TR;DR: the actual proposal is at the end of this email. Rust today has good support for UTF-8 which new content definitely should use, but many systems still have to deal with legacy content that uses other character encodings. There are several projects around to implement more encodings in Rust. The most further along in my opinion is rust-encoding, notably because it implements the right specification. rust-encoding: https://github.com/lifthrasiir/rust-encoding The spec: http://encoding.spec.whatwg.org/ It has more precise definitions of error handling than some original RFCs, and better reflects the reality of legacy content on the web. There was some discussion in the past few days about importing rust-encoding (or part of it) into Rust’s libstd or libextra. Before that, I think it is important to define a good API. The spec defines one for JavaScript, but we should not copy that exactly. rust-encoding’s API is mostly good, but I think that error handling could be simplified. In abstract terms, an encoding (such as UTF-8) is made of a decoder and an encoder. A decoder converts a stream of bytes into a stream of text (Unicode scalar values, ie. code points excluding surrogates), while an encoder does the reverse. This does not cover other kinds of stream transformation such as base64, compression, encryption, etc. Bytes are represented in Rust by u8, text by str/char. (Side note: Because of constraints imposed by JavaScript and to avoid costly conversions, Servo will probably use a different data type for representing text. This encoding API could eventually become generic over a Text trait, but I think that it should stick to str for now.) The most convenient way to represent a stream is with a vector or string. This however requires the whole input to be in memory before decoding/encoding can start, and that to be finished before any of the output can be used. It should definitely be possible to eg. decode some content as it arrives from the network, and parse it in a pipeline. The most fundamental type API is one where the user repeatedly pushes chunks of input into a decoder/encoders object (that may maintain state between chunks) and gets the output so far in return, then signals the end of the input. In iterator adapter where the users pulls output from the decoder which pulls from the input can be nicer, but is easy to build on top of a push-based API, while the reverse requires tasks. Iteratoru8 and Iteratorchar are tempting, but we may need to work on big chucks at a time for efficiency: Iterator~[u8] and Iterator~str. Or could single-byte/char iterators be reliably inlined to achieve similar efficiency? Finally, this API also needs to support several kinds of errors handling. For example, a decoder should abort at the invalid byte sequence for XML, but insert U+FFFD (replacement character) for HTML. I’m not decided yet whether to just have the closed set of error handling modes defined in the spec, or make this open-ended with conditions. Based on all the above, here is a proposed API. Encoders are ommited, but they are mostly the same as decoders with [u8] and str swapped. /// Types implementing this trait are algorithms /// such as UTF8, UTF-16, SingleByteEncoding, etc. /// Values of these types are encodings as defined in the WHATWG spec: /// UTF-8, UTF-16-LE, Windows-1252, etc. trait Encoding { /// Could become an associated type with a ::new() constructor /// when the language supports that. fn new_decoder(self) - ~Decoder; /// Simple, one shot API. /// Decode a single byte string that is entirely in memory. /// May raise the decoding_error condition. fn decode(self, input: [u8]) - Result~str, DecodeError { // Implementation (using a Decoder) left out. // This is a default method, but not meant to be overridden. } } /// Takes the invalid byte sequence. /// Return a replacement string, or None to abort with a DecodeError. condition! { pub decoding_error : ~[u8] - Option~str; } struct DecodeError { input_byte_offset: uint, invalid_byte_sequence: ~[u8], } /// Each implementation of Encoding has one corresponding implementation /// of Decoder (and one of Encoder). /// /// A new Decoder instance should be used for every input. /// A Decoder instance should be discarded after DecodeError was returned. trait Decoder { /// Call this repeatedly with a chunck of input bytes. /// As much as possible of the decoded text is appended to output. /// May raise the decoding_error condition. fn feed(input: [u8], output: mut ~str) - OptionDecodeError; /// Call this to indicate the end of the input. /// The Decoder instance should be discarded afterwards. /// Some encodings may append some final output at this point. /// May raise the decoding_error condition. fn flush(output: mut ~str) - OptionDecodeError; } /// Pull-based API. struct