Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason
2013/8/1 Ian Hickson i...@hixie.ch: On Thu, 1 Aug 2013, Martin Janecke wrote: I don't see any sense in making a document that is declared as ISO-8859-1 and encoded as ISO-8859-1 non-conforming. Just because the ISO-8859-1 code points are a subset of windows-1252? So is US-ASCII. Should an US-ASCII declaration also be non-conforming then -- even if the document only contains bytes from the US-ASCII range? What's the benefit? I assume this is supposed to be helpful in some way, but to me it just seems wrong and confusing. If you avoid the bytes that are different in ISO-8859-1 and Win1252, the spec now allows you to use either label. (As well as cp1252, cp819, ibm819, l1, latin1, x-cp1252, etc.) The part that I find problematic is that if you use use byte 0x85 from Windows 1252 (U+2026 … HORIZONTAL ELLIPSIS), and then label the document as ansi_x3.4-1968, ascii, iso-8859-1, iso-ir-100, iso8859-1, iso_8859-1:1987, us-ascii, or a number of other options, it'll still be valid, and it'll work exactly as if you'd labeled it windows-1252. This despite the fact that in ASCII and in ISO-8859-1, byte 0x85 does not hap to U+2026. It maps to U+0085 in 8859-1, and it is undefined in ASCII (since ASCII is a 7 bit encoding). ISO-8859-1 vs. Windows-1252 issue sounds little issue because 0x85 is Next Line. As far as I know 0x85/U+0085 is used only in some IBM system. For Japanese encoding, there's Shift_JIS vs. Windows-31J issue, which people long annoyed. Windows-31J has many new characters which aren't included in Shift_JIS, and many different Unicode mappings from Shift_JIS. But many existing Web pages specify Shift_JIS and uses characters only in Windows-31J. Therefore if people want to specify a document as truly Shift_JIS, there's no way on the existing framework. It needs a new way for example a new meta specifier like META i-want-to-truly-specify-charset-as=Shift_JIS and browser recognize the document's encoding as true Shift_JIS. But such people should use UTF-8 instead of introducing such new one. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
2012/3/22 Anne van Kesteren ann...@opera.com: As for the API, how about: enc = new Encoder(euc-kr) string1 = enc.encode(bytes1) string2 = enc.encode(bytes2) string3 = enc.eof() // might return empty string if all is fine And similarly you would have dec = new Decoder(shift_jis) bytes = dec.decode(string) Or alternatively you could have a single object that exposes both encode() and decode() and tracks state for both: enc = new Encoding(gb18030) bytes1 = enc.decode(string1) string2 = enc.encode(bytes2) Usually, strings are encoded to bytes. Therefore that encode/decode methods should be reversed like: enc = new Encoding(gb18030) bytes1 = enc.encode(string1) string2 = enc.decode(bytes2) Or if it may cause confusion use getBytes/getChars like Java and C#. http://docs.oracle.com/javase/7/docs/api/java/lang/String.html http://msdn.microsoft.com/en-us/library/system.text.encoder(v=vs.110).aspx#Y1873 http://msdn.microsoft.com/en-us/library/system.text.Decoder(v=vs.110).aspx#Y1873 enc = new Encoding(gb18030) bytes1 = enc.getBytes(string1) string2 = enc.getChars(bytes2) -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
2012/3/21 Glenn Maynard gl...@zewt.org: On Tue, Mar 20, 2012 at 12:39 PM, Joshua Bell jsb...@chromium.org wrote: 1. Only support encodings with stateless coding (possibly down to a minimum of UTF-8) 2. Only provide an API supporting non-streaming coding (i.e. whole strings/whole buffers) 3. Expand the API to return encoder/decoder objects that capture state Any others? Trying to do simplify the problem but take on both (1) and (2) without (3) would lead to an API that could not encompass (3) in the future, which would be a mistake. I don't think that's obviously a mistake. Only the nastiest, wartiest of legacy encodings require it. The categories feels strange. If the conversion is not streaming (whole strings/whole buffers), its implementation should be simply the wrapper of the browser's conversion functions. There is no need to a state object to save the state because the conversion is done with the completion of the function, even if it is stateful encoding. For streaming conversion, it needs state even if the encoding is stateless. When the given partial input is finished at the middle of a character like \xE3\x81\x82\xC2, the conversion consumes 4 bytes, output one character \u3042, and remember the partial bytes \xC2. This bytes is the state. That said, it's fairly simple to later return an additional state object from the previously proposed streaming APIs, eg. result = decode(str, 0, outputView) // result.outputBytes == 15 // result.nextInputByte == 5 // result.state == opaque object result2 = decode(str, result.nextInputByte, outputView, {state: result.state}); You can refer mbsrtowcs(3), which convert a character string to a wide-character string (restartable). It uses opaque state. size_t mbsnrtowcs(wchar_t *restrict dst, const char **restrict src, size_t nmc, size_t len, mbstate_t *restrict ps); http://pubs.opengroup.org/onlinepubs/9699919799/functions/mbsrtowcs.html Anyway, they need error if the byte sequence is invalid for the encoding. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] API for encoding/decoding ArrayBuffers into text
2012/3/21 Jonas Sicking jo...@sicking.cc: I'm pretty sure there is consensus for supporting UTF8. UTF8 is stateful though can be made not stateful by not consuming all characters and instead forcing the caller to keep the state (in the form of unconsumed text). Your use of the word stateful involves misunderstanding. Usually the word stateful encoding means that the encoding keeps a state between characters, not bytes. What you mean is usually expressed by the word multibyte. UTF-8 is multibyte encoding, and it needs to keep a state on streaming. So I would rephrase your 3 options above as: 1) Create an API which forces consumers to do state handling. Probably leading to people creating wrappers which essentially implement option 3 2) Don't support streaming 3) Have encoder/decoder objects which hold state I personally don't think 1 is a good option since it's basically the same as 3 but just with libraries doing some of the work. We might as well do that work so that libraries aren't needed. This leaves us with 2 or 3. So the question is if we should support streaming or not. I suspect doing so would be worth it. I think it should provide non streaming API. And if there are concreate use case, provide streaming API as another one. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Encodings and the web
(2012/01/08 23:32), Anne van Kesteren wrote: On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui nar...@airemix.jp wrote: = Legacy multi-octet Chinese (traditional) encodings Mozilla supports another Big5 variants, Big5-UAO. http://bugs.ruby-lang.org/issues/1784 As part of the big5 encoding, right? It sounds like it's a good idea to adopt that. I don't think there's much concern about table size these days, though obviously the less complexity the better. CC to the original reporter. Could you cooperate about current situation in Taiwan? == iso-2022-jp === The to Unicode algorithm Based on iso-2022-jp state = ASCII state == Based on octet: === Otherwise If the fatal flag is set, return failure. Otherwise, emit the fallback code point. Just FYI, IE and Opera show these bytes as Katakana. If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0. Moreover IE shows any shift_jis characters here. It seems that IE uses the same converter both iso-2022-jp and shift_jis. I have filed a bug on Opera to become more strict like Webkit/Gecko. If there is some evidence that approach is wrong though, we can turn it around. There is a old variant of ISO-2022-JP called JIS8. JIS8 is used before RFC1468 is written, and still used in some area, for exapmle bank-to-bank information exchange. JIS8's 8 means 8bit byte to express Katakana, which is just described above. So I can't state it is a bug on Opera at this time. It is depend on how many sites uses such 8bit Katakana. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Encodings and the web
Hi, thank you for quick reply, (2012/01/09 0:38), Lin Jen-Shin (godfat) wrote: On Sun, Jan 8, 2012 at 11:20 PM, NARUSE, Yui nar...@airemix.jp wrote: (2012/01/08 23:32), Anne van Kesteren wrote: On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui nar...@airemix.jp wrote: = Legacy multi-octet Chinese (traditional) encodings Mozilla supports another Big5 variants, Big5-UAO. http://bugs.ruby-lang.org/issues/1784 As part of the big5 encoding, right? It sounds like it's a good idea to adopt that. I don't think there's much concern about table size these days, though obviously the less complexity the better. CC to the original reporter. Could you cooperate about current situation in Taiwan? I am not sure what I can do here, but I would try my best to coordinate if there's anything I could help. So what are we trying to solve here, again? This is the thread from http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-December/034241.html And discussing about a spec about Encoding on the web. http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html I'm interesting about whether web browsers other than Mozilla should implement Big5-UAO or not. Thanks, -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] iso-2022-jp and octets over 0x7E
(2012/01/09 4:49), Anne van Kesteren wrote: On Sun, 08 Jan 2012 15:32:47 +0100, Anne van Kesteren ann...@opera.com wrote: On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui nar...@airemix.jp wrote: == iso-2022-jp === The to Unicode algorithm Based on iso-2022-jp state = ASCII state == Based on octet: === Otherwise If the fatal flag is set, return failure. Otherwise, emit the fallback code point. Just FYI, IE and Opera show these bytes as Katakana. If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0. Moreover IE shows any shift_jis characters here. It seems that IE uses the same converter both iso-2022-jp and shift_jis. I have filed a bug on Opera to become more strict like Webkit/Gecko. If there is some evidence that approach is wrong though, we can turn it around. So just to be sure I checked again and in Opera you can only get the special single-octet behavior if you active a particular state first. If you are in ASCII, Opera will simply emit the octet unless it is 0x1B (ESC) so maybe there is a system font that does something special for those characters? Or maybe you meant something else? Ah, you are correct. Opera's behavior is different from IE and it is clearly wrong. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Encodings and the web
(2012/01/07 0:38), Anne van Kesteren wrote: On Thu, 22 Dec 2011 15:33:35 +0100, L. David Baron dba...@dbaron.org wrote: This seems like one of those areas where it may be substantially easier to figure out what implementations do by looking at their code than by reverse-engineering, at least for the implementations whose code is available publicly. Gecko's code lives in http://mxr.mozilla.org/mozilla-central/source/intl/uconv/ . There are others who know it substantially better, but I or others could probably answer questions you have about how it works and how to understand it. I'm not the right person for pointers to other implementations, though. Thanks, I'm doing a combination of code inspection, reverse engineering (especially for edge cases), and applying some lessons we learned (e.g. non-greedy error handling). So far I defined the to Unicode algorithms for hz-gb-2312, euc-jp, iso-2022-jp, and shift_jis. = Legacy multi-octet Chinese (traditional) encodings Mozilla supports another Big5 variants, Big5-UAO. http://bugs.ruby-lang.org/issues/1784 = Legacy multi-octet Japanese encodings The jis code point for a given number is: ... The jis0208 index for a given octet is: I wonder about this description. I should explain the concept of JIS X 0208. The most important thing is that JIS X 0208 is on the context of ISO 2022. Its target is ISO/IEC 2022 double byte 94 characters set. It means its code space is 94 x 94. http://en.wikipedia.org/wiki/JIS_X_0208 At the top, there is kuten numbers. ku is row, expressed by the first one of double byte code. ten is cell, expressed by the second one of doubye byte code. So kuten number expresses a code-point. Both ku and ten is an integer from 1 to 94. For example Hiragana Character A, its kuten number is 04-01. ISO-2022-JP, EUC-JP, and Shift_JIS map a kuten number to bytes. ISO-2022-JP's double bytes are: first: ku + 0x20 second: ten + 0x20 EUC-JP's double bytes are: first: ku + 0xA0 second: ten + 0xA0 Shift_JIS's double bytes are: first: if1 = ku = 62 then (ku-1) / 2 + 0x81 elif 63 = ku = 94 then (ku-1) / 2 + 0xC1 second: if ku is even if1 = ku = 63 then ten + 0x3F elif 64 = ku = 94 then ten + 0x40 elif ku is odd then ten + 0x9E So theoretically, we should make a conversion table between kuten numbers and Unicode scalar values. But as you know, JIS X 0208 in web context should be Windows Code Page 932, extended by Microsoft. http://msdn.microsoft.com/en-us/goglobal/cc305152 It is defined by Shift_JIS. The jis0212 index for a given octet is: As written in Bugzilla@Mozilla Bug 600715, IE doesn't support JIS X 0212. https://bugzilla.mozilla.org/show_bug.cgi?id=600715 How treat X0212 in this Encoding spec will be a problem. == iso-2022-jp === The to Unicode algorithm Based on iso-2022-jp state = ASCII state == Based on octet: === Otherwise If the fatal flag is set, return failure. Otherwise, emit the fallback code point. Just FYI, IE and Opera show these bytes as Katakana. If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0. Moreover IE shows any shift_jis characters here. It seems that IE uses the same converter both iso-2022-jp and shift_jis. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Default encoding to UTF-8?
windows-1252 encoded and the override helps everyone. But it may also be the case that the data is in a different encoding and that the override therefore results in gibberish shown to the user, with no hint of the cause of the problem. I think such case doesn't exist. On character encoding overrides a superset overrides a standard set. So I can't imagine the case. It would therefore be better to signal a problem to the user, display the page using the windows-1252 encoding but with some instruction or hint on changing the encoding. And a browser should in this process really analyze whether the data can be windows-1252 encoded data that contains only characters permitted in HTML. Such verification should be done by developer tools, not production browsers which is widely used by real users. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Question about the application/x-www-form-urlencoded encoding algorithm
Hi, (2010/01/21 16:29), NARUSE, Yui wrote: In 4.10.19.4 URL-encoded form data, The application/x-www-form-urlencoded encoding algorithm, it says: For each character in the entry's name and value, apply the following subsubsteps: If the character isn't in the range U+0020, U+002A, U+002D, U+002E, U+0030 to U+0039, U+0041 to U+005A, U+005F, U+0061 to U+007A then replace the character with a string formed as follows: Start with the empty string, and then, taking each byte of the character when expressed in the selected character encoding in turn, append to the string a U+0025 PERCENT SIGN character (%) followed by two characters in the ranges U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9) and U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F representing the hexadecimal value of the byte (zero-padded if necessary). If the character is a U+0020 SPACE character, replace it with a single U+002B PLUS SIGN character (+). This means, U+9670, encoded as \x89\x41 in Shift_JIS, must be encoded as %89%41, and shouldn't be %89A? The spec is read that \x89\x41 in Shift_JIS should be encoded as %89%41. But current impplementations encode it as %89A. (I tested IE, Firefox, Opera, Chrome) So this should be a bug of the spec. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] [hybi] US-ASCII vs. ASCII in Web Socket Protocol
(2010/01/31 2:05), Julian Reschke wrote: Ian Hickson wrote: On Fri, 4 Dec 2009, WeBMartians wrote: Hmmm... Maybe it would be better to say ISO-646US rather than ASCII. There is a lot of impreciseness about the very low value characters (less than 0x20 space) in the ASCII specifications. The same can be said about the higher end. Where the interpretation was normative, I've used the term ANSI_X3.4-1968 (US-ASCII) and referenced RFC1345. I think you just lost both readability and precision. Please keep saying ASCII or US-ASCII, and then have a reference to the ANSI or ISO spec that actually defines ASCII, such as [ANSI.X3-4.1986] American National Standards Institute, Coded Character Set - 7-bit American Standard Code for Information Interchange, ANSI X3.4, 1986. (taken from the relatively recent RFC 5322). RFC 1345 is a non-maintained, historic informational RFC that's nit really a good definition for ASCII. If you disagree, please name a single RFC that has been published in the last 20 years that uses RFC 1345 to reference ASCII (I just searched, and couldn't find any). The use of US-ASCII and ASCII in draft-hixie-thewebsocketprotocol-54 is correct. Changing all to ASCII or ANSI_X3.4-1968 is not correct. In draft-hixie-thewebsocketprotocol-54, allthe term US-ASCII are used as encoded as US-ASCII. This use is as encoding name. So the prefered MIME name, US-ASCII is correct. ASCII is used as * ASCII case-insensitive * ASCII lowercase * ASCII serialization. * ASCII a char like ASCII : or ASCII CR or ASCII space * If /code/, interpreted as ASCII, is 407 * upper-case ASCII letters * Unicode to ASCII * the IDNA ToASCII algorithm * UseSTD3ASCIIRules flags They looks refer to so-called ASCII, not definitions in the spec of ASCII. So the nickname ASCII is suitable for them. Anyway, latest so-called ASCII definition is named ANSI INCITS 4-1986 (R2007). http://webstore.ansi.org/RecordDetail.aspx?sku=ANSI+INCITS+4-1986+(R2007) And its ISO version is ISO/IEC 646:1991 IRV. http://www.iso.org/iso/catalogue_detail.htm?csnumber=4777 -- NARUSE, Yui nar...@airemix.jp
[whatwg] Question about the application/x-www-form-urlencoded encoding algorithm
In 4.10.19.4 URL-encoded form data, The application/x-www-form-urlencoded encoding algorithm, it says: For each character in the entry's name and value, apply the following subsubsteps: If the character isn't in the range U+0020, U+002A, U+002D, U+002E, U+0030 to U+0039, U+0041 to U+005A, U+005F, U+0061 to U+007A then replace the character with a string formed as follows: Start with the empty string, and then, taking each byte of the character when expressed in the selected character encoding in turn, append to the string a U+0025 PERCENT SIGN character (%) followed by two characters in the ranges U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9) and U+0041 LATIN CAPITAL LETTER A to U+0046 LATIN CAPITAL LETTER F representing the hexadecimal value of the byte (zero-padded if necessary). If the character is a U+0020 SPACE character, replace it with a single U+002B PLUS SIGN character (+). This means, U+9670, encoded as ¥x89¥x41 in Shift_JIS, must be encoded as %89%41, and shouldn't be %89A? thanks, -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Ian Hickson wrote: Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on EBCDIC. It is not clear what this means (e.g., the character set JIS_C6226-1983 in any encoding, or only when encoded alone according to RFC1345 as described above); This is talking about character encodings, not character sets. JIS_C6226-1983 is a registered character encoding in the IANA registry. Yes, I can understand this, but... On Fri, 23 Oct 2009, NARUSE, Yui wrote: Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on EBCDIC. First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, moreover those correct names as spec are JIS X 0208 and JIS X 0212. On Thu, 22 Oct 2009, �istein E. Andersen wrote: I am not sure what you mean; they are both listed at http://www.iana.org/assignments/character-sets: Name: JIS_C6226-1983 [RFC1345,KXS2] MIBenum: 63 Source: ECMA registry Alias: iso-ir-87 Alias: x0208 Alias: JIS_X0208-1983 Alias: csISO87JISX0208 Name: JIS_X0212-1990 [RFC1345,KXS2] MIBenum: 98 Source: ECMA registry Alias: x0212 Alias: iso-ir-159 Alias: csISO159JISX02121990 On Fri, 23 Oct 2009, NARUSE, Yui wrote: Where is the word JIS-X-0208 ? Where is the word JIS-X-0212 ? The exact string isn't there, that's why I included the preferred MIME names in brackets in the spec. if it is talking about character encodings, why it uses the name of character sets mainly? Following seems better. Authors should not use JIS_C6226-1983, JIS_X0212-1990, encodings based on ISO-2022, and encodings based On Fri, 23 Oct 2009, NARUSE, Yui wrote: Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not ASCII compatible. So they are out of discouraged; mustn't use. You can use non-ASCII-compatible encodings (e.g. UTF-16). I see. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Øistein E. Andersen wrote: Discouraged encodings: ‘4.2.5.5 Specifying the document's character encoding’ advises against certain encodings. (Incidentally, this advice probably deserves not to be ‘hidden’ in a section nominally reserved for character encoding *declaration* issues.) In particular: Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on EBCDIC. First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, moreover those correct names as spec are JIS X 0208 and JIS X 0212. Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not ASCII compatible. So they are out of discouraged; mustn't use. Finally, Why ISO 2022 series is discouraged is not clear. Anyway, most of charsets defined RFC 1345 are not clear. Conversion table between Unicode is needed. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Øistein E. Andersen wrote: On 22 Oct 2009, at 17:15, NARUSE, Yui wrote: First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, I am not sure what you mean; they are both listed at http://www.iana.org/assignments/character-sets: Name: JIS_C6226-1983 [RFC1345,KXS2] MIBenum: 63 Source: ECMA registry Alias: iso-ir-87 Alias: x0208 Alias: JIS_X0208-1983 Alias: csISO87JISX0208 Where is the word JIS-X-0208 ? Name: JIS_X0212-1990 [RFC1345,KXS2] MIBenum: 98 Source: ECMA registry Alias: x0212 Alias: iso-ir-159 Alias: csISO159JISX02121990 Where is the word JIS-X-0212 ? moreover those correct names as spec are JIS X 0208 and JIS X 0212. Please excuse me for not always paying due attention to such details in e-mails. Of course, the specifications should follow either IANA or the official standard as appropriate, depending on what it is referring to.) Not for you, this sentense is in current HTML5 Draft 4.2.5.5. That is why I paid attention. Anyway, most of charsets defined RFC 1345 are not clear. Conversion table between [those charsets and] Unicode is needed. Quite. Anne van Kesteren, I and several others are currently trying to document how browsers handle different encodings at http://wiki.whatwg.org/wiki/Web_Encodings, and defining mappings to Unicode is one of the goals. Your contribution would be much appreciated. ICU has large set of tables which likely to cover many MS Codepages. (Of course it should be verified) http://bugs.icu-project.org/trac/browser/data/trunk/charset/data/ucm And I have a CP51932 table made from .NET Framework's Coonverter. http://nkf.sourceforge.jp/ucm/cp51932.ucm -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Web Address and its escape
Anne van Kesteren wrote: On Tue, 08 Sep 2009 21:40:22 +0200, NARUSE, Yui nar...@airemix.jp wrote: First is about 4.10.16.4 URL-encoded form data. http://www.whatwg.org/specs/web-apps/current-work/#application/x-www-form-urlencoded-encoding-algorithm In this algorithm at 6.2.1, SP, *, -, ., 0 .. 9, A .. Z, _, a .. z is not escaped. But many other specs which use application/x-www-form-urlencoded refers Which other specifications? Following specifications. (sorry some of them are earlier RFC) XForms 1.0 http://www.w3.org/TR/xforms/#serialize-urlencode then non-ASCII and reserved characters (as defined by [RFC 2396] as amended by subsequent documents in the IETF track) are escaped - so RFC3986 HTML 4 http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 reserved characters are escaped as described in [RFC1738] RFC1738 http://www.faqs.org/rfcs/rfc1738.html unreserved = alpha | digit | safe | extra safe = $ | - | _ | . | + extra = ! | * | ' | ( | ) | , TAG Finding refer to section 2.1 of [RFC2396]. http://www.w3.org/2001/tag/doc/whenToUseGet.html#i18n RFC2396 http://www.faqs.org/rfcs/rfc2396.html unreserved = alphanum | mark mark= - | _ | . | ! | ~ | * | ' | ( | ) WSDL 2.0 http://www.w3.org/TR/wsdl20-bindings/#_http_x-www-form-urlencoded Replacement values falling outside the range (ALPHA and DIGIT below are defined as per [IETF RFC 4234]): ALPHA | DIGIT | - | . | _ | ~ | ! | $ | | ' | ( | ) | * | + | , | ; | = | : | @, MUST be percent-encoded. URI's unreserved. And it in RFC3986 is unreserved= ALPHA / DIGIT / - / . / _ / ~ Why ~ is escaped and * is not escaped? What do browsers do? IE8 QUERY_STRING: t=+%21%5c%22%5c%23%24%25%26%27%28%29*%2b%2c-.%2f0123456789%3a%3b%3c%3d%3e...@abcdefghijklmnopqrstuvwxyz%5b%5c%5c%5d%5e_%60abcdefghijklmnopqrstuvwxyz%7b%7c%7d%7e not escaped: *...@_ Firefox 3.5 QUERY_STRING: t=+%21%5C%22%5C%23%24%25%26%27%28%29*%2B%2C-.%2F0123456789%3A%3B%3C%3D%3E%3F%40ABCDEFGHIJKLMNOPQRSTUVWXYZ%5B%5C%5C%5D%5E_%60abcdefghijklmnopqrstuvwxyz%7B%7C%7D%7E not escaped: *-._ Chrome2 QUERY_STRING: t=+%21%5C%22%5C%23%24%25%26%27%28%29*%2B%2C-.%2F0123456789%3A%3B%3C%3D%3E%3F%40ABCDEFGHIJKLMNOPQRSTUVWXYZ%5B%5C%5C%5D%5E_%60abcdefghijklmnopqrstuvwxyz%7B%7C%7D%7E not escaped: *-._ Opera9 QUERY_STRING: t=+%21%5C%22%5C%23%24%25%26%27%28%29%2A%2B%2C-.%2F0123456789%3A%3B%3C%3D%3E%3F%40ABCDEFGHIJKLMNOPQRSTUVWXYZ%5B%5C%5C%5D%5E_%60abcdefghijklmnopqrstuvwxyz%7B%7C%7D%7E not escaped: -._ Hmm, Firefox and Chrome follow this, IE adds @, Opera removes *. If this spec use safer side, * may be also escaped. Third is about Web addresses in HTML 5. (this spec is also this ML?) http://www.w3.org/html/wg/href/draft You want public-...@w3.org or public-h...@w3.org for that draft. Thanks, I'll send it. -- NARUSE, Yui nar...@airemix.jp
[whatwg] Web Address and its escape
Hi, I have some comments and questions about urlencode and Web Address. First is about 4.10.16.4 URL-encoded form data. http://www.whatwg.org/specs/web-apps/current-work/#application/x-www-form-urlencoded-encoding-algorithm In this algorithm at 6.2.1, SP, *, -, ., 0 .. 9, A .. Z, _, a .. z is not escaped. But many other specs which use application/x-www-form-urlencoded refers URI's unreserved. And it in RFC3986 is unreserved= ALPHA / DIGIT / - / . / _ / ~ Why ~ is escaped and * is not escaped? Second is also URL-encoded form data 6.2.1. This says: the string a U+0025 PERCENT SIGN character (%) followed by two characters in the ranges U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9) and U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z But hexadecimal is 0-9 A-F, so to U+0046 LATIN CAPITAL LETTER F seems right. Third is about Web addresses in HTML 5. (this spec is also this ML?) http://www.w3.org/html/wg/href/draft In 2 Parsing Web addresses at 2. Percent-encode all non-URI characters in w, percent-encoding many characters includeing U+0025 percent sign. But by this spec, if a Web address w is already escaped URL, this process double-escape those characters. For example, w is http://www.example.org/D%C3%BCrst, on step 2, w comes to be http://www.example.org/D%25C3%25BCrst. And on step 5, w is broken. Regards. -- NARUSE, Yui nar...@airemix.jp