Peter Watkins schrieb: > 7.3.3 in draft 11 says > > The "openid2.provider" and "openid2.local_id" URLs MUST NOT include entities > other than "&", "<", ">", and """. Other characters that would > not be valid in the HTML document or that cannot be represented in the > document's character encoding MUST be escaped using the percent-encoding > (%xx) mechanism described in [RFC3986] (Berners-Lee, T., .Uniform Resource > Identifiers (URI): Generic Syntax,. .).
Please note that the draft is completely broken here: It's unclear: The first sentence talks about "entities", which can only refer to "character entity references" (HTML 4.01, 5.3.2). The second sentence mandates RFC 3986 encoding, which is plain wrong because it changes the URI. It does not talk about "numeric character references" at all (which are _not_ entities, see HTML 4.01, 5.3.1), which is the only correct way to encode an URI that contains a "'"/"'"/"'". It's incompatible: A HTML editor, tool or filter may assume that changing any characters to entities is allowed, so it may change "http://[EMAIL PROTECTED]" to "http://example.org?login=user@example.net" withoug changing the meaning. The spec breaks this assumption. It dangerous: It's there to allow RP implementations to use a quick and dirty regexp-based parser instead of a true HTML parser, which (a) may break with completly valid HTML documents (bad user experience) and (b) may circumvent security measures taken by the site owners. > 1) Why are the characters &, <, >, and " allowed to be represented with those > SGML entities? Why not require them to be encoded per RFC 3986 as %26, %3C, > %3E, and %22? The point of RFC 3986 encoding is that URL special chars lose their special meaning _within_ _the_ _URL_: http://example.org/?foo=1&bar=2 contains two parameters: "foo" with the value "1" and "bar" with the value "2". http://example.org/?foo=1%26bar=2 contains a _signle_ parameter, "foo", with the value "1&bar2". The point of HTML encoding is that HTML special chars lose their special meaning _within_ _HTML_: <a href="http://example.org/?x=1©=2"> is a link to the IRI http://example.org/?x=1©=2, which is equivalent with the ASCII URI http://example.org/?x=1%C2%A9%3D2. <a href="http://example.org/?x=1&copy=2"> is a link to the URI http://example.org/?x=1©=2 However, "<" and ">" are not legal within URIs and IRIs anyways. Other characters with named entities are outside the ASCII range and thus illegal in URIs but not IRIs. > 2) Also, should 7.3.3 specify that, as with the key/value data pairs, these > values be encoded in UTF-8? Requiring UTF-8 would free RP code from having > to understand different HTML character sets, and would allow users to encode > their HTML delivery pages in the charset of their choosing. No, the whole HTML document must use the same character set. However, unless you're using IRIs, you can usually get away with treating the document as ASCII; you'll have some characters with the 8th bit set but you can simply ignore them if you just want to extract URIs. Problematic charsets include ISO-2022 (common), Shift-JIS (very common, only "~" a problem wrt URIs, which can't be encoded at all), UTF-16 (rare), UTF-32 (very rare), EBCDIC-based charsets (very rare) and national ISO-646 variants. Claus _______________________________________________ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs