Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

2013-08-03 Thread NARUSE, Yui
2013/8/1 Ian Hickson i...@hixie.ch:
 On Thu, 1 Aug 2013, Martin Janecke wrote:

 I don't see any sense in making a document that is declared as
 ISO-8859-1 and encoded as ISO-8859-1 non-conforming. Just because the
 ISO-8859-1 code points are a subset of windows-1252? So is US-ASCII.
 Should an US-ASCII declaration also be non-conforming then -- even if
 the document only contains bytes from the US-ASCII range? What's the
 benefit?

 I assume this is supposed to be helpful in some way, but to me it just
 seems wrong and confusing.

 If you avoid the bytes that are different in ISO-8859-1 and Win1252, the
 spec now allows you to use either label. (As well as cp1252, cp819,
 ibm819, l1, latin1, x-cp1252, etc.)

 The part that I find problematic is that if you use use byte 0x85 from
 Windows 1252 (U+2026 … HORIZONTAL ELLIPSIS), and then label the document
 as ansi_x3.4-1968, ascii, iso-8859-1, iso-ir-100, iso8859-1,
 iso_8859-1:1987, us-ascii, or a number of other options, it'll still
 be valid, and it'll work exactly as if you'd labeled it windows-1252.
 This despite the fact that in ASCII and in ISO-8859-1, byte 0x85 does not
 hap to U+2026. It maps to U+0085 in 8859-1, and it is undefined in ASCII
 (since ASCII is a 7 bit encoding).

ISO-8859-1 vs. Windows-1252 issue sounds little issue because 0x85 is Next Line.
As far as I know 0x85/U+0085 is used only in some IBM system.

For Japanese encoding, there's Shift_JIS vs. Windows-31J issue, which
people long annoyed.
Windows-31J has many new characters which aren't included in Shift_JIS,
and many different Unicode mappings from Shift_JIS.
But many existing Web pages specify Shift_JIS and uses characters
only in Windows-31J.
Therefore if people want to specify a document as truly Shift_JIS,
there's no way on the existing framework.
It needs a new way for example a new meta specifier like META
i-want-to-truly-specify-charset-as=Shift_JIS
and browser recognize the document's encoding as true Shift_JIS.

But such people should use UTF-8 instead of introducing such new one.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-22 Thread NARUSE, Yui
2012/3/22 Anne van Kesteren ann...@opera.com:
 As for the API, how about:

  enc = new Encoder(euc-kr)
  string1 = enc.encode(bytes1)
  string2 = enc.encode(bytes2)
  string3 = enc.eof() // might return empty string if all is fine

 And similarly you would have

  dec = new Decoder(shift_jis)
  bytes = dec.decode(string)

 Or alternatively you could have a single object that exposes both encode()
 and decode() and tracks state for both:

  enc = new Encoding(gb18030)
  bytes1  = enc.decode(string1)
  string2 = enc.encode(bytes2)

Usually, strings are encoded to bytes.
Therefore that encode/decode methods should be reversed like:

 enc = new Encoding(gb18030)
 bytes1  = enc.encode(string1)
 string2 = enc.decode(bytes2)

Or if it may cause confusion use getBytes/getChars like Java and C#.
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html
http://msdn.microsoft.com/en-us/library/system.text.encoder(v=vs.110).aspx#Y1873
http://msdn.microsoft.com/en-us/library/system.text.Decoder(v=vs.110).aspx#Y1873

 enc = new Encoding(gb18030)
 bytes1  = enc.getBytes(string1)
 string2 = enc.getChars(bytes2)

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-21 Thread NARUSE, Yui
2012/3/21 Glenn Maynard gl...@zewt.org:
 On Tue, Mar 20, 2012 at 12:39 PM, Joshua Bell jsb...@chromium.org wrote:

 1. Only support encodings with stateless coding (possibly down to a minimum
 of UTF-8)
 2. Only provide an API supporting non-streaming coding (i.e. whole
 strings/whole buffers)
 3. Expand the API to return encoder/decoder objects that capture state

 Any others?

 Trying to do simplify the problem but take on both (1) and (2) without (3)
 would lead to an API that could not encompass (3) in the future, which
 would be a mistake.

 I don't think that's obviously a mistake.  Only the nastiest, wartiest of
 legacy encodings require it.

The categories feels strange.

If the conversion is not streaming (whole strings/whole buffers), its
implementation should be simply the wrapper of the browser's
conversion functions.
There is no need to a state object to save the state because the conversion
is done with the completion of the function, even if it is stateful encoding.

For streaming conversion, it needs state even if the encoding is stateless.
When the given partial input is finished at the middle of a character
like \xE3\x81\x82\xC2, the conversion consumes 4 bytes, output one character
\u3042, and remember the partial bytes \xC2. This bytes is the state.

 That said, it's fairly simple to later return an additional state object
 from the previously proposed streaming APIs, eg.

 result = decode(str, 0, outputView)
 // result.outputBytes == 15
 // result.nextInputByte == 5
 // result.state == opaque object

 result2 = decode(str, result.nextInputByte, outputView, {state:
 result.state});

You can refer mbsrtowcs(3), which convert a character string to a wide-character
string (restartable). It uses opaque state.
size_t mbsnrtowcs(wchar_t *restrict dst, const char **restrict src,
   size_t nmc, size_t len, mbstate_t *restrict ps);
http://pubs.opengroup.org/onlinepubs/9699919799/functions/mbsrtowcs.html

Anyway, they need error if the byte sequence is invalid for the encoding.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] API for encoding/decoding ArrayBuffers into text

2012-03-21 Thread NARUSE, Yui
2012/3/21 Jonas Sicking jo...@sicking.cc:
 I'm pretty sure there is consensus for supporting UTF8. UTF8 is
 stateful though can be made not stateful by not consuming all
 characters and instead forcing the caller to keep the state (in the
 form of unconsumed text).

Your use of the word stateful involves misunderstanding.
Usually the word stateful encoding means that the encoding keeps a state
between characters, not bytes.
What you mean is usually expressed by the word multibyte.
UTF-8 is multibyte encoding, and it needs to keep a state on streaming.

 So I would rephrase your 3 options above as:

 1) Create an API which forces consumers to do state handling. Probably
 leading to people creating wrappers which essentially implement option
 3
 2) Don't support streaming
 3) Have encoder/decoder objects which hold state

 I personally don't think 1 is a good option since it's basically the
 same as 3 but just with libraries doing some of the work. We might as
 well do that work so that libraries aren't needed.

 This leaves us with 2 or 3. So the question is if we should support
 streaming or not. I suspect doing so would be worth it.

I think it should provide non streaming API.
And if there are concreate use case, provide streaming API as another one.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Encodings and the web

2012-01-08 Thread NARUSE, Yui
(2012/01/08 23:32), Anne van Kesteren wrote:
 On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui nar...@airemix.jp wrote:
 = Legacy multi-octet Chinese (traditional) encodings

 Mozilla supports another Big5 variants, Big5-UAO.
 http://bugs.ruby-lang.org/issues/1784
 
 As part of the big5 encoding, right? It sounds like it's a good idea to adopt 
 that. I don't think there's much concern about table size these days, though 
 obviously the less complexity the better.

CC to the original reporter.
Could you cooperate about current situation in Taiwan?

 == iso-2022-jp
 === The to Unicode algorithm
  Based on iso-2022-jp state
 = ASCII state
 == Based on octet:
 === Otherwise
 If the fatal flag is set, return failure.
 Otherwise, emit the fallback code point.

 Just FYI, IE and Opera show these bytes as Katakana.
 If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0.

 Moreover IE shows any shift_jis characters here.
 It seems that IE uses the same converter both iso-2022-jp and shift_jis.
 
 I have filed a bug on Opera to become more strict like Webkit/Gecko. If there 
 is some evidence that approach is wrong though, we can turn it around.

There is a old variant of ISO-2022-JP called JIS8.
JIS8 is used before RFC1468 is written, and still used in some area,
for exapmle bank-to-bank information exchange.
JIS8's 8 means 8bit byte to express Katakana, which is just described above.

So I can't state it is a bug on Opera at this time.
It is depend on how many sites uses such 8bit Katakana.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Encodings and the web

2012-01-08 Thread NARUSE, Yui
Hi,

thank you for quick reply,

(2012/01/09 0:38), Lin Jen-Shin (godfat) wrote:
 On Sun, Jan 8, 2012 at 11:20 PM, NARUSE, Yui nar...@airemix.jp wrote:
 (2012/01/08 23:32), Anne van Kesteren wrote:
 On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui nar...@airemix.jp wrote:
 = Legacy multi-octet Chinese (traditional) encodings

 Mozilla supports another Big5 variants, Big5-UAO.
 http://bugs.ruby-lang.org/issues/1784

 As part of the big5 encoding, right? It sounds like it's a good idea to 
 adopt that. I don't think there's much concern about table size these days, 
 though obviously the less complexity the better.

 CC to the original reporter.
 Could you cooperate about current situation in Taiwan?
 
 I am not sure what I can do here, but I would try my best to
 coordinate if there's anything I could help.
 
 So what are we trying to solve here, again?

This is the thread from
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-December/034241.html

And discussing about a spec about Encoding on the web.
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

I'm interesting about whether web browsers other than Mozilla should implement
Big5-UAO or not.

Thanks,

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] iso-2022-jp and octets over 0x7E

2012-01-08 Thread NARUSE, Yui
(2012/01/09 4:49), Anne van Kesteren wrote:
 On Sun, 08 Jan 2012 15:32:47 +0100, Anne van Kesteren ann...@opera.com 
 wrote:
 On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui nar...@airemix.jp wrote:
 == iso-2022-jp
 === The to Unicode algorithm
  Based on iso-2022-jp state
 = ASCII state
 == Based on octet:
 === Otherwise
 If the fatal flag is set, return failure.
 Otherwise, emit the fallback code point.

 Just FYI, IE and Opera show these bytes as Katakana.
 If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0.

 Moreover IE shows any shift_jis characters here.
 It seems that IE uses the same converter both iso-2022-jp and shift_jis.

 I have filed a bug on Opera to become more strict like Webkit/Gecko. If 
 there is some evidence that approach is wrong though, we can turn it around.
 
 So just to be sure I checked again and in Opera you can only get the 
 special single-octet behavior if you active a particular state first. If 
 you are in ASCII, Opera will simply emit the octet unless it is 0x1B (ESC) so 
 maybe there is a system font that does something special for those 
 characters? Or maybe you meant something else?

Ah, you are correct.
Opera's behavior is different from IE and it is clearly wrong.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Encodings and the web

2012-01-07 Thread NARUSE, Yui
(2012/01/07 0:38), Anne van Kesteren wrote:
 On Thu, 22 Dec 2011 15:33:35 +0100, L. David Baron dba...@dbaron.org wrote:
 This seems like one of those areas where it may be substantially
 easier to figure out what implementations do by looking at their
 code than by reverse-engineering, at least for the implementations
 whose code is available publicly.

 Gecko's code lives in
 http://mxr.mozilla.org/mozilla-central/source/intl/uconv/ .  There
 are others who know it substantially better, but I or others could
 probably answer questions you have about how it works and how to
 understand it.

 I'm not the right person for pointers to other implementations,
 though.
 
 Thanks, I'm doing a combination of code inspection, reverse engineering 
 (especially for edge cases), and applying some lessons we learned (e.g. 
 non-greedy error handling).
 
 So far I defined the to Unicode algorithms for hz-gb-2312, euc-jp, 
 iso-2022-jp, and shift_jis.

= Legacy multi-octet Chinese (traditional) encodings

Mozilla supports another Big5 variants, Big5-UAO.
http://bugs.ruby-lang.org/issues/1784

= Legacy multi-octet Japanese encodings

 The jis code point for a given number is: ...
 The jis0208 index for a given octet is:

I wonder about this description.
I should explain the concept of JIS X 0208.

The most important thing is that JIS X 0208 is on the context of ISO 2022.
Its target is ISO/IEC 2022 double byte 94 characters set.
It means its code space is 94 x 94.
http://en.wikipedia.org/wiki/JIS_X_0208

At the top, there is kuten numbers.
ku is row, expressed by the first one of double byte code.
ten is cell, expressed by the second one of doubye byte code.
So kuten number expresses a code-point.
Both ku and ten is an integer from 1 to 94.
For example Hiragana Character A, its kuten number is 04-01.

ISO-2022-JP, EUC-JP, and Shift_JIS map a kuten number to bytes.
ISO-2022-JP's double bytes are:
 first:  ku  + 0x20
 second: ten + 0x20
EUC-JP's double bytes are:
 first:  ku  + 0xA0
 second: ten + 0xA0
Shift_JIS's double bytes are:
 first:  if1 = ku = 62 then (ku-1) / 2 + 0x81
 elif 63 = ku = 94 then (ku-1) / 2 + 0xC1
 second: if ku is even
   if1 = ku = 63 then ten + 0x3F
   elif 64 = ku = 94 then ten + 0x40
 elif ku is odd then ten + 0x9E


So theoretically, we should make a conversion table between
kuten numbers and Unicode scalar values.

But as you know, JIS X 0208 in web context should be Windows Code Page 932,
extended by Microsoft.
http://msdn.microsoft.com/en-us/goglobal/cc305152
It is defined by Shift_JIS.

 The jis0212 index for a given octet is:

As written in Bugzilla@Mozilla Bug 600715, IE doesn't support JIS X 0212.
https://bugzilla.mozilla.org/show_bug.cgi?id=600715
How treat X0212 in this Encoding spec will be a problem.

== iso-2022-jp
=== The to Unicode algorithm
 Based on iso-2022-jp state
= ASCII state
== Based on octet:
=== Otherwise
 If the fatal flag is set, return failure.
 Otherwise, emit the fallback code point.

Just FYI, IE and Opera show these bytes as Katakana.
If octet is greater than 0xA0 and less than 0xE0, value is octet + 0xFEC0.

Moreover IE shows any shift_jis characters here.
It seems that IE uses the same converter both iso-2022-jp and shift_jis.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Default encoding to UTF-8?

2011-12-06 Thread NARUSE, Yui
 windows-1252 encoded and the override helps everyone. But 
 it may also be the case that the data is in a different encoding and that the 
 override therefore results in gibberish shown to the user, with no hint of 
 the cause of the problem.

I think such case doesn't exist.
On character encoding overrides a superset overrides a standard set.
So I can't imagine the case.

 It would therefore be better to signal a problem to the user, display the 
 page using the windows-1252 encoding but with some instruction or hint on 
 changing the encoding. And a browser should in this process really analyze 
 whether the data can be windows-1252 encoded data that contains only 
 characters permitted in HTML.

Such verification should be done by developer tools, not production browsers
which is widely used by real users.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Question about the application/x-www-form-urlencoded encoding algorithm

2010-03-21 Thread NARUSE, Yui

Hi,

(2010/01/21 16:29), NARUSE, Yui wrote:

In 4.10.19.4 URL-encoded form data, The
application/x-www-form-urlencoded encoding algorithm,
it says:


For each character in the entry's name and value, apply the following 
subsubsteps:

If the character isn't in the range U+0020, U+002A, U+002D, U+002E,
U+0030 to U+0039, U+0041 to U+005A, U+005F, U+0061 to U+007A
then replace the character with a string formed as follows:
Start with the empty string, and then, taking each byte of the character
when expressed in the selected character encoding in turn,
append to the string a U+0025 PERCENT SIGN character (%) followed
by two characters in the ranges U+0030 DIGIT ZERO (0) to
U+0039 DIGIT NINE (9) and U+0041 LATIN CAPITAL LETTER A
to U+0046 LATIN CAPITAL LETTER F representing the hexadecimal value
of the byte (zero-padded if necessary).

If the character is a U+0020 SPACE character, replace it with a single U+002B 
PLUS SIGN character (+).


This means, U+9670, encoded as \x89\x41 in Shift_JIS, must be
encoded as %89%41,
and shouldn't be %89A?


The spec is read that
\x89\x41 in Shift_JIS should be encoded as %89%41.
But current impplementations encode it as %89A.
(I tested IE, Firefox, Opera, Chrome)

So this should be a bug of the spec.

--
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] [hybi] US-ASCII vs. ASCII in Web Socket Protocol

2010-01-31 Thread NARUSE, Yui
(2010/01/31 2:05), Julian Reschke wrote:
 Ian Hickson wrote:
 On Fri, 4 Dec 2009, WeBMartians wrote:
 Hmmm... Maybe it would be better to say ISO-646US rather than ASCII.
 There is a lot of impreciseness about the very low value characters
 (less than 0x20 space) in the ASCII specifications. The same can be
 said about the higher end.

 Where the interpretation was normative, I've used the term
 ANSI_X3.4-1968 (US-ASCII) and referenced RFC1345.
 
 I think you just lost both readability and precision.
 
 Please keep saying ASCII or US-ASCII, and then have a reference to
 the ANSI or ISO spec that actually defines ASCII, such as
 
[ANSI.X3-4.1986]  American National Standards Institute, Coded
  Character Set - 7-bit American Standard Code for
  Information Interchange, ANSI X3.4, 1986.
 
 (taken from the relatively recent RFC 5322).
 
 RFC 1345 is a non-maintained, historic informational RFC that's nit
 really a good definition for ASCII. If you disagree, please name a
 single RFC that has been published in the last 20 years that uses RFC
 1345 to reference ASCII (I just searched, and couldn't find any).

The use of US-ASCII and ASCII in draft-hixie-thewebsocketprotocol-54 is correct.
Changing all to ASCII or ANSI_X3.4-1968 is not correct.

In draft-hixie-thewebsocketprotocol-54, allthe term US-ASCII are used as
encoded as US-ASCII. This use is as encoding name.
So the prefered MIME name, US-ASCII is correct.

ASCII is used as
* ASCII case-insensitive
* ASCII lowercase
* ASCII serialization.
* ASCII a char like ASCII : or ASCII CR or ASCII space
* If /code/, interpreted as ASCII, is 407
* upper-case ASCII letters
* Unicode to ASCII
* the IDNA ToASCII algorithm
* UseSTD3ASCIIRules flags
They looks refer to so-called ASCII, not definitions in the spec of ASCII.
So the nickname ASCII is suitable for them.


Anyway,
latest so-called ASCII definition is named ANSI INCITS 4-1986 (R2007).
http://webstore.ansi.org/RecordDetail.aspx?sku=ANSI+INCITS+4-1986+(R2007)

And its ISO version is ISO/IEC 646:1991 IRV.
http://www.iso.org/iso/catalogue_detail.htm?csnumber=4777

-- 
NARUSE, Yui  nar...@airemix.jp


[whatwg] Question about the application/x-www-form-urlencoded encoding algorithm

2010-01-20 Thread NARUSE, Yui
In 4.10.19.4 URL-encoded form data, The
application/x-www-form-urlencoded encoding algorithm,
it says:

 For each character in the entry's name and value, apply the following 
 subsubsteps:

 If the character isn't in the range U+0020, U+002A, U+002D, U+002E,
 U+0030 to U+0039, U+0041 to U+005A, U+005F, U+0061 to U+007A
 then replace the character with a string formed as follows:
 Start with the empty string, and then, taking each byte of the character
 when expressed in the selected character encoding in turn,
 append to the string a U+0025 PERCENT SIGN character (%) followed
 by two characters in the ranges U+0030 DIGIT ZERO (0) to
 U+0039 DIGIT NINE (9) and U+0041 LATIN CAPITAL LETTER A
 to U+0046 LATIN CAPITAL LETTER F representing the hexadecimal value
 of the byte (zero-padded if necessary).

 If the character is a U+0020 SPACE character, replace it with a single U+002B 
 PLUS SIGN character (+).

This means, U+9670, encoded as ¥x89¥x41 in Shift_JIS, must be
encoded as %89%41,
and shouldn't be %89A?

thanks,

-- 
NARUSE, Yui
nar...@airemix.jp


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-23 Thread NARUSE, Yui


Ian Hickson wrote:
 Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on
 EBCDIC.
 It is not clear what this means (e.g., the character set JIS_C6226-1983 in
 any encoding, or only when encoded alone according to RFC1345 as described
 above); 
 
 This is talking about character encodings, not character sets. 
 JIS_C6226-1983 is a registered character encoding in the IANA registry.

Yes, I can understand this, but...

 On Fri, 23 Oct 2009, NARUSE, Yui wrote:
 Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 
 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based 
 on EBCDIC.
 First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, moreover 
 those correct names as spec are JIS X 0208 and JIS X 0212.
 
 On Thu, 22 Oct 2009, �istein E. Andersen wrote:
 I am not sure what you mean; they are both listed at
 http://www.iana.org/assignments/character-sets:

 Name: JIS_C6226-1983 [RFC1345,KXS2]
 MIBenum: 63
 Source: ECMA registry
 Alias: iso-ir-87
 Alias: x0208
 Alias: JIS_X0208-1983
 Alias: csISO87JISX0208

 Name: JIS_X0212-1990 [RFC1345,KXS2]
 MIBenum: 98
 Source: ECMA registry
 Alias: x0212
 Alias: iso-ir-159
 Alias: csISO159JISX02121990
 
 On Fri, 23 Oct 2009, NARUSE, Yui wrote:
 Where is the word JIS-X-0208 ?
 Where is the word JIS-X-0212 ?
 
 The exact string isn't there, that's why I included the preferred MIME 
 names in brackets in the spec.

if it is talking about character encodings,
why it uses the name of character sets mainly?
Following seems better.

 Authors should not use JIS_C6226-1983, JIS_X0212-1990,
 encodings based on ISO-2022, and encodings based 

 On Fri, 23 Oct 2009, NARUSE, Yui wrote:
 Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not
 ASCII compatible. So they are out of discouraged; mustn't use.
 
 You can use non-ASCII-compatible encodings (e.g. UTF-16).

I see.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-22 Thread NARUSE, Yui
Øistein E. Andersen wrote:
 Discouraged encodings:
 ‘4.2.5.5 Specifying the document's character encoding’ advises against
 certain encodings.  (Incidentally, this advice probably deserves not
 to be ‘hidden’ in a section nominally reserved for character encoding
 *declaration* issues.)  In particular:

 Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on
 EBCDIC.

First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets,
moreover those correct names as spec are JIS X 0208 and JIS X 0212.

Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not
ASCII compatible. So they are out of discouraged; mustn't use.

Finally, Why ISO 2022 series is discouraged is not clear.


Anyway, most of charsets defined RFC 1345 are not clear.
Conversion table between Unicode is needed.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-22 Thread NARUSE, Yui


Øistein E. Andersen wrote:
 On 22 Oct 2009, at 17:15, NARUSE, Yui wrote:
 
 First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets,
 
 I am not sure what you mean; they are both listed at
 http://www.iana.org/assignments/character-sets:
 
 Name: JIS_C6226-1983 [RFC1345,KXS2]
 MIBenum: 63
 Source: ECMA registry
 Alias: iso-ir-87
 Alias: x0208
 Alias: JIS_X0208-1983
 Alias: csISO87JISX0208

Where is the word JIS-X-0208 ?

 Name: JIS_X0212-1990 [RFC1345,KXS2]
 MIBenum: 98
 Source: ECMA registry
 Alias: x0212
 Alias: iso-ir-159
 Alias: csISO159JISX02121990

Where is the word JIS-X-0212 ?

 moreover those correct names as spec are JIS X 0208 and JIS X 0212.
 
 Please
 excuse me for not always paying due attention to such details in
 e-mails. Of course, the specifications should follow either IANA or the
 official standard as appropriate, depending on what it is referring to.)

Not for you, this sentense is in current HTML5 Draft 4.2.5.5.
That is why I paid attention.

 Anyway, most of charsets defined RFC 1345 are not clear.
 Conversion table between [those charsets and] Unicode is needed.
 
 Quite.  Anne van Kesteren, I and several others are currently trying to
 document how browsers handle different encodings at
 http://wiki.whatwg.org/wiki/Web_Encodings, and defining mappings to
 Unicode is one of the goals.  Your contribution would be much appreciated.

ICU has large set of tables which likely to cover many MS Codepages.
(Of course it should be verified)
http://bugs.icu-project.org/trac/browser/data/trunk/charset/data/ucm

And I have a CP51932 table made from .NET Framework's Coonverter.
http://nkf.sourceforge.jp/ucm/cp51932.ucm

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Web Address and its escape

2009-09-09 Thread NARUSE, Yui
Anne van Kesteren wrote:
 On Tue, 08 Sep 2009 21:40:22 +0200, NARUSE, Yui nar...@airemix.jp wrote:
 First is about 4.10.16.4 URL-encoded form data.
 http://www.whatwg.org/specs/web-apps/current-work/#application/x-www-form-urlencoded-encoding-algorithm


 In this algorithm at 6.2.1,
 SP, *, -, ., 0 .. 9, A .. Z, _, a .. z is not escaped.
 But many other specs which use application/x-www-form-urlencoded refers
 
 Which other specifications?

Following specifications. (sorry some of them are earlier RFC)

XForms 1.0
  http://www.w3.org/TR/xforms/#serialize-urlencode
  then non-ASCII and reserved characters (as defined by [RFC 2396] as
  amended by subsequent documents in the IETF track) are escaped
  - so RFC3986

HTML 4
  http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
  reserved characters are escaped as described in [RFC1738]
  RFC1738 http://www.faqs.org/rfcs/rfc1738.html
unreserved = alpha | digit | safe | extra
safe   = $ | - | _ | . | +
extra  = ! | * | ' | ( | ) | ,

TAG Finding
  refer to section 2.1 of [RFC2396].
  http://www.w3.org/2001/tag/doc/whenToUseGet.html#i18n
  RFC2396 http://www.faqs.org/rfcs/rfc2396.html
  unreserved  = alphanum | mark
  mark= - | _ | . | ! | ~ | * | ' | ( | )

WSDL 2.0
  http://www.w3.org/TR/wsdl20-bindings/#_http_x-www-form-urlencoded
  Replacement values falling outside the range (ALPHA and DIGIT below are 
defined
  as per [IETF RFC 4234]): ALPHA | DIGIT | - | . | _ | ~ | ! |
  $ |  | ' | ( | ) | * | + | , | ; | = | : | @,
  MUST be percent-encoded.

 URI's unreserved. And it in RFC3986 is
unreserved= ALPHA / DIGIT / - / . / _ / ~
 Why ~ is escaped and * is not escaped?
 
 What do browsers do?

IE8
QUERY_STRING: 
t=+%21%5c%22%5c%23%24%25%26%27%28%29*%2b%2c-.%2f0123456789%3a%3b%3c%3d%3e...@abcdefghijklmnopqrstuvwxyz%5b%5c%5c%5d%5e_%60abcdefghijklmnopqrstuvwxyz%7b%7c%7d%7e
not escaped: *...@_

Firefox 3.5
QUERY_STRING: 
t=+%21%5C%22%5C%23%24%25%26%27%28%29*%2B%2C-.%2F0123456789%3A%3B%3C%3D%3E%3F%40ABCDEFGHIJKLMNOPQRSTUVWXYZ%5B%5C%5C%5D%5E_%60abcdefghijklmnopqrstuvwxyz%7B%7C%7D%7E
not escaped: *-._

Chrome2
QUERY_STRING: 
t=+%21%5C%22%5C%23%24%25%26%27%28%29*%2B%2C-.%2F0123456789%3A%3B%3C%3D%3E%3F%40ABCDEFGHIJKLMNOPQRSTUVWXYZ%5B%5C%5C%5D%5E_%60abcdefghijklmnopqrstuvwxyz%7B%7C%7D%7E
not escaped: *-._

Opera9
QUERY_STRING: 
t=+%21%5C%22%5C%23%24%25%26%27%28%29%2A%2B%2C-.%2F0123456789%3A%3B%3C%3D%3E%3F%40ABCDEFGHIJKLMNOPQRSTUVWXYZ%5B%5C%5C%5D%5E_%60abcdefghijklmnopqrstuvwxyz%7B%7C%7D%7E
not escaped: -._

Hmm, Firefox and Chrome follow this, IE adds @, Opera removes *.
If this spec use safer side, * may be also escaped.

 Third is about Web addresses in HTML 5. (this spec is also this ML?)
 http://www.w3.org/html/wg/href/draft
 
 You want public-...@w3.org or public-h...@w3.org for that draft.

Thanks, I'll send it.

-- 
NARUSE, Yui  nar...@airemix.jp


[whatwg] Web Address and its escape

2009-09-08 Thread NARUSE, Yui
Hi,
I have some comments and questions about urlencode and Web Address.


First is about 4.10.16.4 URL-encoded form data.
http://www.whatwg.org/specs/web-apps/current-work/#application/x-www-form-urlencoded-encoding-algorithm

In this algorithm at 6.2.1,
SP, *, -, ., 0 .. 9, A .. Z, _, a .. z is not escaped.
But many other specs which use application/x-www-form-urlencoded refers
URI's unreserved. And it in RFC3986 is
   unreserved= ALPHA / DIGIT / - / . / _ / ~
Why ~ is escaped and * is not escaped?


Second is also URL-encoded form data 6.2.1.
This says:
 the string a U+0025 PERCENT SIGN character (%) followed by two
 characters in the ranges U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE
 (9) and U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z
But hexadecimal is 0-9 A-F,
so to U+0046 LATIN CAPITAL LETTER F seems right.


Third is about Web addresses in HTML 5. (this spec is also this ML?)
http://www.w3.org/html/wg/href/draft

In 2 Parsing Web addresses at 2. Percent-encode all non-URI characters in w,
percent-encoding many characters includeing U+0025 percent sign.
But by this spec, if a Web address w is already escaped URL,
this process double-escape those characters.

For example, w is http://www.example.org/D%C3%BCrst,
on step 2, w comes to be http://www.example.org/D%25C3%25BCrst.
And on step 5, w is broken.

Regards.

-- 
NARUSE, Yui  nar...@airemix.jp