Re: [rust-dev] Proposed API for character encodings

2013-09-22 Thread Simon Sapin

Le 21/09/2013 16:38, Olivier Renaud a écrit :

I'd expect this offset to be absolute. After all, the only thing that the
programmer can do with this information at this point is to report it to the
user ; if the programmer wanted to handle the error, he could have done it by
using a trap. A relative offset has no meaning outside of the processing loop,
whereas an absolute offset can still be useful even outside of the program (if
the source of the stream is a file, then an absolute offset will give the exact
location of the error in the file).

A counter is super cheap, I would'nt worry about its cost. Actually, it just
has to be incremented once for each call to 'feed'.


Well to get the position inside a given chunk of input you still have to 
count individual bytes. (Maybe with Iterator::enumerate?) Unless maybe 
we do dirty pointer arithmetic…


If possible, I’d rather find a way to not have to pay that cost in the 
common case where the error handling is *not* abort and DecodeError is 
never used.


This is also a bit annoying as each implementation will have to repeat 
the counting logic, but maybe it’s still worth it.




Note : for the encoder, you will have to specify wether the offset is a 'code
point' count or a 'code unit' count.


Yes. I don’t know yet. If we do [1] and make the input generic it will 
probably have to be code points.


[1] https://mail.mozilla.org/pipermail/rust-dev/2013-September/005662.html

Otherwise, it may be preferable to match Str::slice and count UTF-8 
bytes. (Which I suppose is what you call code units?)


--
Simon Sapin
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Proposed API for character encodings

2013-09-21 Thread Simon Sapin

Le 20/09/2013 20:07, Olivier Renaud a écrit :

I have one more question regarding the error handling : in DecodeError, what
does 'input_byte_offset' mean ? Is it relative to the 'invalid_byte_sequence'
or to the beginning of the decoded stream ?


Good point. I’m not sure. (Remember I make this up as we go along :).)
If it’s from the entirety of the input this would require decoders to 
keep count, which is unnecessary work in cases where you don’t use it. 
(eg. with the Replace error handling.)


So it could be from the beginning of the input in the last call to 
.feed() to the begining of the invalid byte sequence, *which can be 
negative*, in case the invalid sequence started in an earlier .feed() call.


What do you think it should be?

--
Simon Sapin
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Proposed API for character encodings

2013-09-20 Thread Olivier Renaud
I really like the API you are proposing. In particular, the error handling is 
close to what I was expecting from such an API.

I have some remarks, though.

Is there a reason for encoders and decoders to not be reusable ? I think it 
would be reasonable to specify that they get back to their initial state once 
the 'flush' method is called, or when a 'DecodeError' is returned.

Is a condition raised when the order of method calls is not respected ? E.g. 
if one calls 'flush' multiple times, of calls 'feed' and then 'decode' ?

It is not clear what is given as a parameter to the 'decoding_error' 
condition. I guess it's the exact subset of byte sequence that cannot be 
decoded, possibly spanning multiple 'feed' calls. Is that correct ? Is it 
sufficient for variable-length encodings ?

I am doubtful that the encoder is just a decoder with [u8] and str swapped. A 
decoder must deal with a possibly invalid sequence of bytes, while an encoder 
deals with str, which is guaranteed to be a valid utf8 sequence. An encoder 
must handle unmappable characters, whereas a decoder doesn't (actually, it 
depends whether we consider unicode to be universal or not...).

I think it would be a good idea to make a difference between an invalid 
sequence and an unmappable character. I think there should be both an 
'invalid_sequence' and an 'unmappable_char' condition.

Also, the 'fatal' handler is a bit scary, based on the name I'd expect it to 
result in a 'fail!'.

I propose this set of conditions and handlers :

// Decoder conditions
condition! {
 /// The byte sequence is not a valid input
 pub invalid_sequence : ~[u8] - Option~str;
 /// The byte sequence cannot be represented in Unicode (rarely used)
 pub unmappable_bytes : ~[u8] - Option~str;
}

// Encoder condition
condition! {
 /// The Unicode string cannot be represented in the target encoding
 /// (essential for single byte encodings)
 pub unmappable_str : ~str - Option~[u8];
}

/// Functions to be used with invalid_sequence::cond.trap
/// or unmappable_bytes::cond.trap
mod decoding_error_handlers {
 fn decoder_error(_: ~[u8]) - Option~str { None }
 fn replacement(_: ~[u8]) - Option~str { Some(~\uFFFD) }
 fn ascii_substitute(_: ~[u8]) - Option~str { Some(~\u001A) }
 fn ignore(_: ~[u8]) - Option~str { Some(~) }
}

/// Functions to be used with unmappable_str::cond.trap
mod encoding_error_handlers {
 fn decoder_error(_: ~str) - Option~[u8] { None }
 fn ascii_substitute(_: ~str) - Option~[u8] { Some(~[0x1A]) }
 fn ignore(_: ~str) - Option~[u8] { Some(~[]) }
}

Not sure about this substitute/replacement duality. Maybe we can have only one 
function name 'default', that would be FFFD for unicode and 1A for ascii.
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Proposed API for character encodings

2013-09-20 Thread Simon Sapin

Le 20/09/2013 10:18, Olivier Renaud a écrit :

I really like the API you are proposing. In particular, the error handling is
close to what I was expecting from such an API.

I have some remarks, though.

Is there a reason for encoders and decoders to not be reusable ? I think it
would be reasonable to specify that they get back to their initial state once
the 'flush' method is called, or when a 'DecodeError' is returned.


I don’t have a strong opinion on that. There could be a reset or 
similar method, but I don’t see how this is better than just throwing 
the decoder away and making a new one.


With static dispatch and the encoding known at compile-time, you can 
probably have decoders on the stack so making a new one is cheap.


If the encoding is determined at run-time and you use trait objects 
(dynamic dispatch) for decoders, the next input might have a different 
encoding so reusing decoders might not be useful either.




Is a condition raised when the order of method calls is not respected ? E.g.
if one calls 'flush' multiple times, of calls 'feed' and then 'decode' ?


Decoder::decode is a static method / associated function. It’s 
independent from everything else.


Other than that, I don’t know. rust-encoding doesn’t do that. AFAIU it 
leaves this behavior undefined, which I think is fine. Do you think it 
should be explicitly checked for?




It is not clear what is given as a parameter to the 'decoding_error'
condition. I guess it's the exact subset of byte sequence that cannot be
decoded, possibly spanning multiple 'feed' calls. Is that correct ? Is it
sufficient for variable-length encodings ?


Correct, and I think yes. It is called once every time the spec says to 
run the error algorithm:


http://encoding.spec.whatwg.org/#error



I am doubtful that the encoder is just a decoder with [u8] and str swapped. A
decoder must deal with a possibly invalid sequence of bytes, while an encoder
deals with str, which is guaranteed to be a valid utf8 sequence. An encoder
must handle unmappable characters, whereas a decoder doesn't


You’re right, I cut some corners. In particular, the encoding_error 
condition can take a single (unsupported) 'char'. Other than that, the 
*API* is (very close to?) the same with [u8] and str swapped.




(actually, it
depends whether we consider unicode to be universal or not...).


I suggest we consider it is. (For the purpose of the WHATWG spec it is.) 
If Unicode is missing things, the right solution is to add things to 
Unicode.




I think it would be a good idea to make a difference between an invalid
sequence and an unmappable character. I think there should be both an
'invalid_sequence' and an 'unmappable_char' condition.


That’s the distinction between decoding_error and encoding_error, which 
already exists.




Also, the 'fatal' handler is a bit scary, based on the name I'd expect it to
result in a 'fail!'.


I’m open to other names. Maybe abort? The idea is that you reject the 
entirety of this input (including previous successful calls to .feed())




I propose this set of conditions and handlers :

// Decoder conditions
condition! {
  /// The byte sequence is not a valid input
  pub invalid_sequence : ~[u8] - Option~str;
  /// The byte sequence cannot be represented in Unicode (rarely used)
  pub unmappable_bytes : ~[u8] - Option~str;
}

// Encoder condition
condition! {
  /// The Unicode string cannot be represented in the target encoding
  /// (essential for single byte encodings)
  pub unmappable_str : ~str - Option~[u8];
}


I think that unmappable_bytes is not needed, and the other two should 
just be decoding_error and encoding_error. (See above.)




/// Functions to be used with invalid_sequence::cond.trap
/// or unmappable_bytes::cond.trap
mod decoding_error_handlers {
  fn decoder_error(_: ~[u8]) - Option~str { None }
  fn replacement(_: ~[u8]) - Option~str { Some(~\uFFFD) }
  fn ascii_substitute(_: ~[u8]) - Option~str { Some(~\u001A) }
  fn ignore(_: ~[u8]) - Option~str { Some(~) }
}

/// Functions to be used with unmappable_str::cond.trap
mod encoding_error_handlers {
  fn decoder_error(_: ~str) - Option~[u8] { None }
  fn ascii_substitute(_: ~str) - Option~[u8] { Some(~[0x1A]) }
  fn ignore(_: ~str) - Option~[u8] { Some(~[]) }
}

Not sure about this substitute/replacement duality. Maybe we can have only one
function name 'default', that would be FFFD for unicode and 1A for ascii.


I think we should only provide two handlers for each of decoding and 
encoding: fail/abort/error, and replace. The latter is U+FFFD 
(replacement character) for decoding and 0x3F (ASCII question mark) for 
encoding as in the WHATWG spec, per web-compatibility constraints.


In particular, ignore is terrible and should not be encouraged. 
(Depending on what you’re doing with it, it could lead to security 
issues.) If you do want ignore or ASCII substitute, writing a custom 
condition handler is easy 

Re: [rust-dev] Proposed API for character encodings

2013-09-20 Thread Simon Sapin

Le 20/09/2013 13:40, Henri Sivonen a écrit :

On Tue, Sep 10, 2013 at 6:47 PM, Simon Sapinsimon.sa...@exyr.org  wrote:

 /// Call this to indicate the end of the input.
 /// The Decoder instance should be discarded afterwards.
 /// Some encodings may append some final output at this point.
 /// May raise the decoding_error condition.
 fn flush(output: mut ~str) - OptionDecodeError;

Please call this finish instead of calling it flush. In other
APIs, for example JDK APIs, flush really just means flushing the
current buffers instead of ending the stream, so calling the method
that does end-of-stream processing flush would be confusing.


flush is the name that rust-encoding uses, but I argee that finish 
is better for what it does.


--
Simon Sapin
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Proposed API for character encodings

2013-09-20 Thread Simon Sapin

Le 10/09/2013 16:47, Simon Sapin a écrit :

TR;DR: the actual proposal is at the end of this email.


I moved this to the wiki, to better deal with updates:
https://github.com/mozilla/rust/wiki/Proposal-for-character-encoding-API

--
Simon Sapin
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Proposed API for character encodings

2013-09-20 Thread Olivier Renaud
Le vendredi 20 septembre 2013 11:47:04 Simon Sapin a écrit :
 Le 20/09/2013 10:18, Olivier Renaud a écrit :
  I really like the API you are proposing. In particular, the error handling
  is close to what I was expecting from such an API.
  
  I have some remarks, though.
  
  Is there a reason for encoders and decoders to not be reusable ? I think
  it
  would be reasonable to specify that they get back to their initial state
  once the 'flush' method is called, or when a 'DecodeError' is returned.
 I don’t have a strong opinion on that. There could be a reset or
 similar method, but I don’t see how this is better than just throwing
 the decoder away and making a new one.

I don't see the need for a 'reset' method. A decoder could return to its 
initial state after a call to 'finish'.

 With static dispatch and the encoding known at compile-time, you can
 probably have decoders on the stack so making a new one is cheap.
 
 If the encoding is determined at run-time and you use trait objects
 (dynamic dispatch) for decoders, the next input might have a different
 encoding so reusing decoders might not be useful either.

My typical usage of a charset decoder is to read many files on disk, all of 
them using the same charset.

  Is a condition raised when the order of method calls is not respected ?
  E.g. if one calls 'flush' multiple times, of calls 'feed' and then
  'decode' ?
 Decoder::decode is a static method / associated function. It’s
 independent from everything else.

Oh yes of course, my bad.

 Other than that, I don’t know. rust-encoding doesn’t do that. AFAIU it
 leaves this behavior undefined, which I think is fine. Do you think it
 should be explicitly checked for?

Well, in  my opinion it is not a good idea for an API to have undefined 
behavior. Being explicit about what is disallowed also helps the user to 
understand how the API is supposed to be used. Also, I think it's preferable 
to fail fast, when the state of an object becomes invalid.

There are a handful of reasonable behavior, for the decoder :

* If reusing a decoder is legal, then calling 'feed' after 'finish' is legal 
(we start decoding a new stream), no need to introduce a special case. A 
second call to 'finish' can be a noop (we decode an empty stream)

* If reusing a decoder is illegal:

-- Calling 'feed' after 'finish' should be an error. The API must report that 
it is being misused by the programmer. I don't know what is the recommended 
way to do that in Rust. I think it's ok to fail!, or to have an assert. In 
Java, I'd throw an (unchecked) IllegalStateException, which serves exactly 
this purpose.

-- Calling 'finish' a second time can also be a noop, but it would be better to 
be consistent with the 'feed' after 'finish' behavior and to fail.

Another totally different solution would be to use phantom types, to indicate 
the state of the decoder, but that would be overkill. Or typestates :)

Simpler is better, so I think having a reusable decoder with no special 
invalid state is the least problematic solution.

  It is not clear what is given as a parameter to the 'decoding_error'
  condition. I guess it's the exact subset of byte sequence that cannot be
  decoded, possibly spanning multiple 'feed' calls. Is that correct ? Is it
  sufficient for variable-length encodings ?
 
 Correct, and I think yes. It is called once every time the spec says to
 run the error algorithm:
 
 http://encoding.spec.whatwg.org/#error
 
  I am doubtful that the encoder is just a decoder with [u8] and str
  swapped. A decoder must deal with a possibly invalid sequence of bytes,
  while an encoder deals with str, which is guaranteed to be a valid utf8
  sequence. An encoder must handle unmappable characters, whereas a decoder
  doesn't
 
 You’re right, I cut some corners. In particular, the encoding_error
 condition can take a single (unsupported) 'char'. Other than that, the
 *API* is (very close to?) the same with [u8] and str swapped.
 
  (actually, it
  depends whether we consider unicode to be universal or not...).
 
 I suggest we consider it is. (For the purpose of the WHATWG spec it is.)
 If Unicode is missing things, the right solution is to add things to
 Unicode.

It simplifies many things, indeed.

  [...]
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Proposed API for character encodings

2013-09-20 Thread Olivier Renaud
Le vendredi 20 septembre 2013 11:52:14 Simon Sapin a écrit :
 Le 13/09/2013 23:03, Simon Sapin a écrit :
  /// Takes the invalid byte sequence.
  /// Return a replacement string, or None to abort with a DecodeError.
  condition! {
  
pub decoding_error : ~[u8] - Option~str;
  
  }
  
  /// Functions to be used with decoding_error::cond.trap
  mod decoding_error_handlers {
  
fn fatal(_: ~[u8]) - Option~str { None }
fn replacement(_: ~[u8]) - Option~str { Some(~\uFFFD) }
  
  }
 
 Allocating ~\uFFFD repeatedly is, let’s say, unfortunate. This could
 be avoided by having the return value be:
 
 enum DecodingErrorResult {
  AbortDecoding,
  ReplacementString(~str),
  ReplacementChar(char),
 }
 
 Similarly, for encoding:
 
 enum EncodingErrorResult {
  AbortDecoding,
  ReplacamentByteSequence(~[u8]),
  ReplacementByte(u8),
 }

That's a nice addition, it's even better this way !

I have one more question regarding the error handling : in DecodeError, what 
does 'input_byte_offset' mean ? Is it relative to the 'invalid_byte_sequence' 
or to the beginning of the decoded stream ?
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Proposed API for character encodings

2013-09-20 Thread Henri Sivonen
On Tue, Sep 10, 2013 at 6:47 PM, Simon Sapin simon.sa...@exyr.org wrote:
 /// Call this to indicate the end of the input.
 /// The Decoder instance should be discarded afterwards.
 /// Some encodings may append some final output at this point.
 /// May raise the decoding_error condition.
 fn flush(output: mut ~str) - OptionDecodeError;

Please call this finish instead of calling it flush. In other
APIs, for example JDK APIs, flush really just means flushing the
current buffers instead of ending the stream, so calling the method
that does end-of-stream processing flush would be confusing.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
http://hsivonen.iki.fi/
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Proposed API for character encodings

2013-09-19 Thread Simon Sapin

Le 19/09/2013 13:39, Jeffery Olson a écrit :

As to the implementation: rust-encoding has a lot that could be adapted.
https://github.com/__lifthrasiir/rust-encoding
https://github.com/lifthrasiir/rust-encoding


Can someone comment on whether we should look at adapting what's in
str::from_utf8 (really, str::raw::from_buf_len is where the action is)
and str::from_utf16 for this? Everyone in IRC I ask says that they are
correct.. they're also highly optimized.. are they appropriate for
this API? And if not, are comfortable having two totally separate paths
for string decoding?


I don’t think anybody is advocating duplicating implementations of the 
same thing. My understanding is that UTF8Decoder and the existing API in 
std::str will end up calling the same code.


That code could be libstd’s existing implementation extended for error 
handling, or rust-encoding’s, or something else. I don’t have a strong 
opinion about it.


UTF-16 is a bit special, because libstd’s existing APIs deal with 
native-endian [u16], while encoding APIs will need both UTF-16-LE and 
UTF-16-BE on [u8]. I don’t know how much can be shared.



But once again, I’m more interested in getting the API and the behavior 
right. I trust the smart people working on Rust to refactor and optimize 
the implementation over time.


Cheers,
--
Simon Sapin
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] Proposed API for character encodings

2013-09-18 Thread Brian Anderson

On 09/10/2013 08:47 AM, Simon Sapin wrote:

Hi,

TR;DR: the actual proposal is at the end of this email.


Thanks for working on this. It's crucial.



Rust today has good support for UTF-8 which new content definitely 
should use, but many systems still have to deal with legacy content 
that uses other character encodings. There are several projects around 
to implement more encodings in Rust. The most further along in my 
opinion is rust-encoding, notably because it implements the right 
specification.


rust-encoding: https://github.com/lifthrasiir/rust-encoding

The spec: http://encoding.spec.whatwg.org/
It has more precise definitions of error handling than some original 
RFCs, and better reflects the reality of legacy content on the web.


There was some discussion in the past few days about importing 
rust-encoding (or part of it) into Rust’s libstd or libextra. Before 
that, I think it is important to define a good API. The spec defines 
one for JavaScript, but we should not copy that exactly. 
rust-encoding’s API is mostly good, but I think that error handling 
could be simplified.



In abstract terms, an encoding (such as UTF-8) is made of a decoder 
and an encoder. A decoder converts a stream of bytes into a stream of 
text (Unicode scalar values, ie. code points excluding surrogates), 
while an encoder does the reverse. This does not cover other kinds of 
stream transformation such as base64, compression, encryption, etc.


Bytes are represented in Rust by u8, text by str/char.

(Side note: Because of constraints imposed by JavaScript and to avoid 
costly conversions, Servo will probably use a different data type for 
representing text. This encoding API could eventually become generic 
over a Text trait, but I think that it should stick to str for now.)



The most convenient way to represent a stream is with a vector or 
string. This however requires the whole input to be in memory before 
decoding/encoding can start, and that to be finished before any of the 
output can be used. It should definitely be possible to eg. decode 
some content as it arrives from the network, and parse it in a pipeline.


The most fundamental type API is one where the user repeatedly 
pushes chunks of input into a decoder/encoders object (that may 
maintain state between chunks) and gets the output so far in return, 
then signals the end of the input.


In iterator adapter where the users pulls output from the decoder 
which pulls from the input can be nicer, but is easy to build on top 
of a push-based API, while the reverse requires tasks.


Iteratoru8 and Iteratorchar are tempting, but we may need to work 
on big chucks at a time for efficiency: Iterator~[u8] and 
Iterator~str. Or could single-byte/char iterators be reliably 
inlined to achieve similar efficiency?


Can Iterator[u8] work if the iterator itself contains a fixed-sized 
or preallocated buffer? For I/O purposes, allocating a bunch of buffers 
just to write them out to a stream sounds wasteful..





Finally, this API also needs to support several kinds of errors 
handling. For example, a decoder should abort at the invalid byte 
sequence for XML, but insert U+FFFD (replacement character) for HTML. 
I’m not decided yet whether to just have the closed set of error 
handling modes defined in the spec, or make this open-ended with 
conditions.



Based on all the above, here is a proposed API. Encoders are ommited, 
but they are mostly the same as decoders with [u8] and str swapped.



/// Types implementing this trait are algorithms
/// such as UTF8, UTF-16, SingleByteEncoding, etc.
/// Values of these types are encodings as defined in the WHATWG spec:
/// UTF-8, UTF-16-LE, Windows-1252, etc.
trait Encoding {
/// Could become an associated type with a ::new() constructor
/// when the language supports that.
fn new_decoder(self) - ~Decoder;

/// Simple, one shot API.
/// Decode a single byte string that is entirely in memory.
/// May raise the decoding_error condition.
fn decode(self, input: [u8]) - Result~str, DecodeError {
// Implementation (using a Decoder) left out.
// This is a default method, but not meant to be overridden.
}
}

/// Takes the invalid byte sequence.
/// Return a replacement string, or None to abort with a DecodeError.
condition! {
pub decoding_error : ~[u8] - Option~str;
}

struct DecodeError {
input_byte_offset: uint,
invalid_byte_sequence: ~[u8],
}

/// Each implementation of Encoding has one corresponding implementation
/// of Decoder (and one of Encoder).
///
/// A new Decoder instance should be used for every input.
/// A Decoder instance should be discarded after DecodeError was 
returned.

trait Decoder {
/// Call this repeatedly with a chunck of input bytes.
/// As much as possible of the decoded text is appended to output.
/// May raise the decoding_error condition.
fn feed(input: [u8], output: mut ~str) - OptionDecodeError;

/// Call this 

Re: [rust-dev] Proposed API for character encodings

2013-09-11 Thread Marvin Löbel

On 09/10/2013 05:47 PM, Simon Sapin wrote:

Hi,

TR;DR: the actual proposal is at the end of this email.

Rust today has good support for UTF-8 which new content definitely 
should use, but many systems still have to deal with legacy content 
that uses other character encodings. There are several projects around 
to implement more encodings in Rust. The most further along in my 
opinion is rust-encoding, notably because it implements the right 
specification.


rust-encoding: https://github.com/lifthrasiir/rust-encoding

The spec: http://encoding.spec.whatwg.org/
It has more precise definitions of error handling than some original 
RFCs, and better reflects the reality of legacy content on the web.


There was some discussion in the past few days about importing 
rust-encoding (or part of it) into Rust’s libstd or libextra. Before 
that, I think it is important to define a good API. The spec defines 
one for JavaScript, but we should not copy that exactly. 
rust-encoding’s API is mostly good, but I think that error handling 
could be simplified.



In abstract terms, an encoding (such as UTF-8) is made of a decoder 
and an encoder. A decoder converts a stream of bytes into a stream of 
text (Unicode scalar values, ie. code points excluding surrogates), 
while an encoder does the reverse. This does not cover other kinds of 
stream transformation such as base64, compression, encryption, etc.


Bytes are represented in Rust by u8, text by str/char.

(Side note: Because of constraints imposed by JavaScript and to avoid 
costly conversions, Servo will probably use a different data type for 
representing text. This encoding API could eventually become generic 
over a Text trait, but I think that it should stick to str for now.)



The most convenient way to represent a stream is with a vector or 
string. This however requires the whole input to be in memory before 
decoding/encoding can start, and that to be finished before any of the 
output can be used. It should definitely be possible to eg. decode 
some content as it arrives from the network, and parse it in a pipeline.


The most fundamental type API is one where the user repeatedly 
pushes chunks of input into a decoder/encoders object (that may 
maintain state between chunks) and gets the output so far in return, 
then signals the end of the input.


In iterator adapter where the users pulls output from the decoder 
which pulls from the input can be nicer, but is easy to build on top 
of a push-based API, while the reverse requires tasks.


Iteratoru8 and Iteratorchar are tempting, but we may need to work 
on big chucks at a time for efficiency: Iterator~[u8] and 
Iterator~str. Or could single-byte/char iterators be reliably 
inlined to achieve similar efficiency?



Finally, this API also needs to support several kinds of errors 
handling. For example, a decoder should abort at the invalid byte 
sequence for XML, but insert U+FFFD (replacement character) for HTML. 
I’m not decided yet whether to just have the closed set of error 
handling modes defined in the spec, or make this open-ended with 
conditions.



Based on all the above, here is a proposed API. Encoders are ommited, 
but they are mostly the same as decoders with [u8] and str swapped.



/// Types implementing this trait are algorithms
/// such as UTF8, UTF-16, SingleByteEncoding, etc.
/// Values of these types are encodings as defined in the WHATWG spec:
/// UTF-8, UTF-16-LE, Windows-1252, etc.
trait Encoding {
/// Could become an associated type with a ::new() constructor
/// when the language supports that.
fn new_decoder(self) - ~Decoder;

/// Simple, one shot API.
/// Decode a single byte string that is entirely in memory.
/// May raise the decoding_error condition.
fn decode(self, input: [u8]) - Result~str, DecodeError {
// Implementation (using a Decoder) left out.
// This is a default method, but not meant to be overridden.
}
}

/// Takes the invalid byte sequence.
/// Return a replacement string, or None to abort with a DecodeError.
condition! {
pub decoding_error : ~[u8] - Option~str;
}

struct DecodeError {
input_byte_offset: uint,
invalid_byte_sequence: ~[u8],
}

/// Each implementation of Encoding has one corresponding implementation
/// of Decoder (and one of Encoder).
///
/// A new Decoder instance should be used for every input.
/// A Decoder instance should be discarded after DecodeError was 
returned.

trait Decoder {
/// Call this repeatedly with a chunck of input bytes.
/// As much as possible of the decoded text is appended to output.
/// May raise the decoding_error condition.
fn feed(input: [u8], output: mut ~str) - OptionDecodeError;

/// Call this to indicate the end of the input.
/// The Decoder instance should be discarded afterwards.
/// Some encodings may append some final output at this point.
/// May raise the decoding_error condition.
fn flush(output: mut ~str) - 

Re: [rust-dev] Proposed API for character encodings

2013-09-11 Thread Simon Sapin

Le 11/09/2013 17:19, Marvin Löbel a écrit :

On 09/10/2013 05:47 PM, Simon Sapin wrote:

Hi,

TR;DR: the actual proposal is at the end of this email.

Rust today has good support for UTF-8 which new content definitely
should use, but many systems still have to deal with legacy content
that uses other character encodings. There are several projects around
to implement more encodings in Rust. The most further along in my
opinion is rust-encoding, notably because it implements the right
specification.

rust-encoding: https://github.com/lifthrasiir/rust-encoding

The spec: http://encoding.spec.whatwg.org/
It has more precise definitions of error handling than some original
RFCs, and better reflects the reality of legacy content on the web.

There was some discussion in the past few days about importing
rust-encoding (or part of it) into Rust’s libstd or libextra. Before
that, I think it is important to define a good API. The spec defines
one for JavaScript, but we should not copy that exactly.
rust-encoding’s API is mostly good, but I think that error handling
could be simplified.


In abstract terms, an encoding (such as UTF-8) is made of a decoder
and an encoder. A decoder converts a stream of bytes into a stream of
text (Unicode scalar values, ie. code points excluding surrogates),
while an encoder does the reverse. This does not cover other kinds of
stream transformation such as base64, compression, encryption, etc.

Bytes are represented in Rust by u8, text by str/char.

(Side note: Because of constraints imposed by JavaScript and to avoid
costly conversions, Servo will probably use a different data type for
representing text. This encoding API could eventually become generic
over a Text trait, but I think that it should stick to str for now.)


The most convenient way to represent a stream is with a vector or
string. This however requires the whole input to be in memory before
decoding/encoding can start, and that to be finished before any of the
output can be used. It should definitely be possible to eg. decode
some content as it arrives from the network, and parse it in a pipeline.

The most fundamental type API is one where the user repeatedly
pushes chunks of input into a decoder/encoders object (that may
maintain state between chunks) and gets the output so far in return,
then signals the end of the input.

In iterator adapter where the users pulls output from the decoder
which pulls from the input can be nicer, but is easy to build on top
of a push-based API, while the reverse requires tasks.

Iteratoru8 and Iteratorchar are tempting, but we may need to work
on big chucks at a time for efficiency: Iterator~[u8] and
Iterator~str. Or could single-byte/char iterators be reliably
inlined to achieve similar efficiency?


Finally, this API also needs to support several kinds of errors
handling. For example, a decoder should abort at the invalid byte
sequence for XML, but insert U+FFFD (replacement character) for HTML.
I’m not decided yet whether to just have the closed set of error
handling modes defined in the spec, or make this open-ended with
conditions.


Based on all the above, here is a proposed API. Encoders are ommited,
but they are mostly the same as decoders with [u8] and str swapped.


/// Types implementing this trait are algorithms
/// such as UTF8, UTF-16, SingleByteEncoding, etc.
/// Values of these types are encodings as defined in the WHATWG spec:
/// UTF-8, UTF-16-LE, Windows-1252, etc.
trait Encoding {
 /// Could become an associated type with a ::new() constructor
 /// when the language supports that.
 fn new_decoder(self) - ~Decoder;

 /// Simple, one shot API.
 /// Decode a single byte string that is entirely in memory.
 /// May raise the decoding_error condition.
 fn decode(self, input: [u8]) - Result~str, DecodeError {
 // Implementation (using a Decoder) left out.
 // This is a default method, but not meant to be overridden.
 }
}

/// Takes the invalid byte sequence.
/// Return a replacement string, or None to abort with a DecodeError.
condition! {
 pub decoding_error : ~[u8] - Option~str;
}

struct DecodeError {
 input_byte_offset: uint,
 invalid_byte_sequence: ~[u8],
}

/// Each implementation of Encoding has one corresponding implementation
/// of Decoder (and one of Encoder).
///
/// A new Decoder instance should be used for every input.
/// A Decoder instance should be discarded after DecodeError was
returned.
trait Decoder {
 /// Call this repeatedly with a chunck of input bytes.
 /// As much as possible of the decoded text is appended to output.
 /// May raise the decoding_error condition.
 fn feed(input: [u8], output: mut ~str) - OptionDecodeError;

 /// Call this to indicate the end of the input.
 /// The Decoder instance should be discarded afterwards.
 /// Some encodings may append some final output at this point.
 /// May raise the decoding_error condition.
 fn 

[rust-dev] Proposed API for character encodings

2013-09-10 Thread Simon Sapin

Hi,

TR;DR: the actual proposal is at the end of this email.

Rust today has good support for UTF-8 which new content definitely 
should use, but many systems still have to deal with legacy content that 
uses other character encodings. There are several projects around to 
implement more encodings in Rust. The most further along in my opinion 
is rust-encoding, notably because it implements the right specification.


rust-encoding: https://github.com/lifthrasiir/rust-encoding

The spec: http://encoding.spec.whatwg.org/
It has more precise definitions of error handling than some original 
RFCs, and better reflects the reality of legacy content on the web.


There was some discussion in the past few days about importing 
rust-encoding (or part of it) into Rust’s libstd or libextra. Before 
that, I think it is important to define a good API. The spec defines one 
for JavaScript, but we should not copy that exactly. rust-encoding’s API 
is mostly good, but I think that error handling could be simplified.



In abstract terms, an encoding (such as UTF-8) is made of a decoder 
and an encoder. A decoder converts a stream of bytes into a stream of 
text (Unicode scalar values, ie. code points excluding surrogates), 
while an encoder does the reverse. This does not cover other kinds of 
stream transformation such as base64, compression, encryption, etc.


Bytes are represented in Rust by u8, text by str/char.

(Side note: Because of constraints imposed by JavaScript and to avoid 
costly conversions, Servo will probably use a different data type for 
representing text. This encoding API could eventually become generic 
over a Text trait, but I think that it should stick to str for now.)



The most convenient way to represent a stream is with a vector or 
string. This however requires the whole input to be in memory before 
decoding/encoding can start, and that to be finished before any of the 
output can be used. It should definitely be possible to eg. decode some 
content as it arrives from the network, and parse it in a pipeline.


The most fundamental type API is one where the user repeatedly pushes 
chunks of input into a decoder/encoders object (that may maintain state 
between chunks) and gets the output so far in return, then signals the 
end of the input.


In iterator adapter where the users pulls output from the decoder 
which pulls from the input can be nicer, but is easy to build on top 
of a push-based API, while the reverse requires tasks.


Iteratoru8 and Iteratorchar are tempting, but we may need to work on 
big chucks at a time for efficiency: Iterator~[u8] and Iterator~str. 
Or could single-byte/char iterators be reliably inlined to achieve 
similar efficiency?



Finally, this API also needs to support several kinds of errors 
handling. For example, a decoder should abort at the invalid byte 
sequence for XML, but insert U+FFFD (replacement character) for HTML. 
I’m not decided yet whether to just have the closed set of error 
handling modes defined in the spec, or make this open-ended with conditions.



Based on all the above, here is a proposed API. Encoders are ommited, 
but they are mostly the same as decoders with [u8] and str swapped.



/// Types implementing this trait are algorithms
/// such as UTF8, UTF-16, SingleByteEncoding, etc.
/// Values of these types are encodings as defined in the WHATWG spec:
/// UTF-8, UTF-16-LE, Windows-1252, etc.
trait Encoding {
/// Could become an associated type with a ::new() constructor
/// when the language supports that.
fn new_decoder(self) - ~Decoder;

/// Simple, one shot API.
/// Decode a single byte string that is entirely in memory.
/// May raise the decoding_error condition.
fn decode(self, input: [u8]) - Result~str, DecodeError {
// Implementation (using a Decoder) left out.
// This is a default method, but not meant to be overridden.
}
}

/// Takes the invalid byte sequence.
/// Return a replacement string, or None to abort with a DecodeError.
condition! {
pub decoding_error : ~[u8] - Option~str;
}

struct DecodeError {
input_byte_offset: uint,
invalid_byte_sequence: ~[u8],
}

/// Each implementation of Encoding has one corresponding implementation
/// of Decoder (and one of Encoder).
///
/// A new Decoder instance should be used for every input.
/// A Decoder instance should be discarded after DecodeError was returned.
trait Decoder {
/// Call this repeatedly with a chunck of input bytes.
/// As much as possible of the decoded text is appended to output.
/// May raise the decoding_error condition.
fn feed(input: [u8], output: mut ~str) - OptionDecodeError;

/// Call this to indicate the end of the input.
/// The Decoder instance should be discarded afterwards.
/// Some encodings may append some final output at this point.
/// May raise the decoding_error condition.
fn flush(output: mut ~str) - OptionDecodeError;
}

/// Pull-based API.
struct