Re: [whatwg] Encoding Standard (mostly complete)
This is a decoder error seems odd; it's descriptive language (this thing must be made true) rather than declarative (do this thing). I'd suggest the declarative language Emit a decoder error and Emit an encoder error. If code point is equal or greater than lower boundary is more naturally greater than or equal to (and less than or equal to). That said, this would be much clearer with interval syntax: If code point is in the range [*lower boundary*, 0x10] and is not in the range [0xD800, 0xDFFF], emit code point (and continue). which I think is easier to read, and also makes it clear that the 0xD800 to 0xDFFF is a closed interval (0xD800 and 0xDFFF are included). An encoder contains one or more encoder error points. Unless stated otherwise the encoder is terminated at that point. Encoding form data, at least, doesn't abort on the first error; any unrepresentable codepoints are encoded as as x1234;. (It would sure be nice if encoding to non-Unicode-based encodings would just *always* use that syntax for non-ASCII, so the encoders could be dropped, but I guess that would trigger bugs in pages that are currently masked...) Is there any encoding path in browsers that does give up on the first error? -- Glenn Maynard
Re: [whatwg] Encoding Standard (mostly complete)
On Tue, 17 Apr 2012 14:01:36 +0200, Julian Reschke julian.resc...@gmx.de wrote: As a nit, I believe that Character Encoding would make a better title than just Encoding. I was thinking maybe Text Encoding given how we're using that for the API, but I like single-word specs so I'm quite reluctant to change it. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Encoding Standard (mostly complete)
On Wed, 18 Apr 2012 15:40:33 +0200, Glenn Maynard gl...@zewt.org wrote: This is a decoder error seems odd; it's descriptive language (this thing must be made true) rather than declarative (do this thing). I'd suggest the declarative language Emit a decoder error and Emit an encoder error. Yes. Awesome suggestion implemented. If code point is equal or greater than lower boundary is more naturally greater than or equal to (and less than or equal to). That said, this would be much clearer with interval syntax: If code point is in the range [*lower boundary*, 0x10] and is not in the range [0xD800, 0xDFFF], emit code point (and continue). which I think is easier to read, and also makes it clear that the 0xD800 to 0xDFFF is a closed interval (0xD800 and 0xDFFF are included). Then we'd first have to introduce interval syntax to the English language. We could do that I suppose in the Terminology section if you think it would be better. An encoder contains one or more encoder error points. Unless stated otherwise the encoder is terminated at that point. Encoding form data, at least, doesn't abort on the first error; any unrepresentable codepoints are encoded as as x1234;. (It would sure be nice if encoding to non-Unicode-based encodings would just *always* use that syntax for non-ASCII, so the encoders could be dropped, but I guess that would trigger bugs in pages that are currently masked...) Is there any encoding path in browsers that does give up on the first error? It has been proposed for the API. And in URLs you do not get #...; (though in WebKit you do) but you get ? (IE at the network layer, Opera earlier on) or the utf-8 representation (Gecko is totally weird). Maybe we should align URLs with form here and use #...; throughout if that is compatible with content. Probably deserves a a discussion in its own thread. I do not know any cases beyond URLs, form, and the proposed API that require an encoder in the platform. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Encoding Standard (mostly complete)
On Wed, Apr 18, 2012 at 12:12 PM, Anne van Kesteren ann...@opera.comwrote: If code point is equal or greater than lower boundary is more naturally greater than or equal to (and less than or equal to). That said, this would be much clearer with interval syntax: If code point is in the range [*lower boundary*, 0x10] and is not in the range [0xD800, 0xDFFF], emit code point (and continue). which I think is easier to read, and also makes it clear that the 0xD800 to 0xDFFF is a closed interval (0xD800 and 0xDFFF are included). Then we'd first have to introduce interval syntax to the English language. We could do that I suppose in the Terminology section if you think it would be better. It would also apply to http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#index-gb18030-code-point, and it could apply to select ranges (eg. 7.1 step 5: [0,0x7f]). Maybe it's not enough to be worth figuring out how to define it. Encoding form data, at least, doesn't abort on the first error; any unrepresentable codepoints are encoded as as x1234;. (It would sure be nice if encoding to non-Unicode-based encodings would just *always* use that syntax for non-ASCII, so the encoders could be dropped, but I guess that would trigger bugs in pages that are currently masked...) Is there any encoding path in browsers that does give up on the first error? It has been proposed for the API. And in URLs you do not get #...; (though in WebKit you do) but you get ? (IE at the network layer, Opera earlier on) or the utf-8 representation (Gecko is totally weird). I was testing with POST, which (at least in Gecko) uses HTML escapes for unrepresentable characters. (It would be pretty neat if that could be changed to *always* using HTML escapes for non-ASCII, except when encoding to UTF-8, since that's not introducing anything new--you can already receive x1234; escapes in POST data--and it would alleviate the form submit encoding depends on the source page's encoding problem. I guess this must break pages somehow, or vendors would have done this long ago.) -- Glenn Maynard
Re: [whatwg] Encoding Standard (mostly complete)
On 19/04/2012 07:34, Glenn Maynard wrote: On Wed, Apr 18, 2012 at 12:12 PM, Anne van Kesteren ann...@opera.comwrote: Then we'd first have to introduce interval syntax to the English language. We could do that I suppose in the Terminology section if you think it would be better. It would also apply to http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#index-gb18030-code-point, and it could apply to select ranges (eg. 7.1 step 5: [0,0x7f]). Maybe it's not enough to be worth figuring out how to define it. All it takes is a couple of short sentences. *Numeric Intervals.* Closed intervals are denoted with square brackets and open intervals with round brackets. For example, [0, 10) denotes the values from zero to ten, including zero but not including ten. I agree with Glenn that using intervals would be clearer as well as being shorter. Regards -Mark
Re: [whatwg] Encoding Standard (mostly complete)
I find having the steps incrementing the byte and code point pointers being before the current byte or code point is processed (except for the EOF check) confusing but a way to make it less confusing is not obvious. Regards -Mark On 17/04/2012 18:30, Anne van Kesteren wrote: Hi, Apart from big5 (which requires some more research) all encoders and decoders are now defined: ...