Re: HTML discovery: SGML entities and charsets
Peter Watkins schrieb: 7.3.3 in draft 11 says The openid2.provider and openid2.local_id URLs MUST NOT include entities other than amp;, lt;, gt;, and quot;. Other characters that would not be valid in the HTML document or that cannot be represented in the document's character encoding MUST be escaped using the percent-encoding (%xx) mechanism described in [RFC3986] (Berners-Lee, T., .Uniform Resource Identifiers (URI): Generic Syntax,. .). Please note that the draft is completely broken here: It's unclear: The first sentence talks about entities, which can only refer to character entity references (HTML 4.01, 5.3.2). The second sentence mandates RFC 3986 encoding, which is plain wrong because it changes the URI. It does not talk about numeric character references at all (which are _not_ entities, see HTML 4.01, 5.3.1), which is the only correct way to encode an URI that contains a '/#39;/#x27;. It's incompatible: A HTML editor, tool or filter may assume that changing any characters to entities is allowed, so it may change http://[EMAIL PROTECTED] to http://example.org?login=user#64;example.net; withoug changing the meaning. The spec breaks this assumption. It dangerous: It's there to allow RP implementations to use a quick and dirty regexp-based parser instead of a true HTML parser, which (a) may break with completly valid HTML documents (bad user experience) and (b) may circumvent security measures taken by the site owners. 1) Why are the characters , , , and allowed to be represented with those SGML entities? Why not require them to be encoded per RFC 3986 as %26, %3C, %3E, and %22? The point of RFC 3986 encoding is that URL special chars lose their special meaning _within_ _the_ _URL_: http://example.org/?foo=1bar=2 contains two parameters: foo with the value 1 and bar with the value 2. http://example.org/?foo=1%26bar=2 contains a _signle_ parameter, foo, with the value 1bar2. The point of HTML encoding is that HTML special chars lose their special meaning _within_ _HTML_: a href=http://example.org/?x=1copy=2; is a link to the IRI http://example.org/?x=1©=2, which is equivalent with the ASCII URI http://example.org/?x=1%C2%A9%3D2. a href=http://example.org/?x=1amp;copy=2; is a link to the URI http://example.org/?x=1copy=2 However, and are not legal within URIs and IRIs anyways. Other characters with named entities are outside the ASCII range and thus illegal in URIs but not IRIs. 2) Also, should 7.3.3 specify that, as with the key/value data pairs, these values be encoded in UTF-8? Requiring UTF-8 would free RP code from having to understand different HTML character sets, and would allow users to encode their HTML delivery pages in the charset of their choosing. No, the whole HTML document must use the same character set. However, unless you're using IRIs, you can usually get away with treating the document as ASCII; you'll have some characters with the 8th bit set but you can simply ignore them if you just want to extract URIs. Problematic charsets include ISO-2022 (common), Shift-JIS (very common, only ~ a problem wrt URIs, which can't be encoded at all), UTF-16 (rare), UTF-32 (very rare), EBCDIC-based charsets (very rare) and national ISO-646 variants. Claus ___ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs
Re: HTML discovery: SGML entities and charsets
Josh Hoyt schrieb: There has been a little discussion in the past about the restriction on allowed character entity references. I don't think there has been any about numeric character references, except in lumping them in with character entity references. These restrictions live on from the OpenID 1 specification, and were preserved primarily to ease backwards compatibility (IIRC). It seems that it has been taken from the pingback specification: http://www.hixie.ch/specs/pingback/pingback#TOC2.2 The rationale given is that it should not be necessary to implement a full HTML parser. Unfortunatly, this allegation is completly bogous: As HTML has a context-sensitive grammar, you just can't parse it with regular expressions. If you try, you will ineviteably write a parser that falls for some HTML constructs users might expect to work. (For example: comments. It nearly unimaginable but users might even try to put comment markers around an OpenID link, add another OpenID link and expect RPs to use the one not within a comment?) Others will also try and will ineviteably write a parser that falls for some _different_ HTML constructs. The result is that one RP might work with an URL (because it can handle comments within the HTML) and another one does not. Without looking at the code of the RP's HTML parser, it is nearly impossible for the user to tell why some RPs fail. If that isn't extremly bad user experience, what is? (As a side note: There's no telling whether there's a security risk with some RPs, either.) The only way around this is using a real HTML parser. If you do, there's no reason not to parse and handle all character references. Claus ___ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs
Re: Final outstanding issues with the OpenID 2.0 Authenticationspecification
Marius Scurtescu schrieb: The new attribute values are needed in order to signal an OpenID 2 provider. Why is this necessary? Is OpenID 2 incompatible? In other words, what happens if an OpenID 2 Relying Party tries to talk to an OpenID 1.x Provider? If the OpenID 1.x Provider just ignores additional message fields (i.e. treats them like an unknown extension), then no new rel values are needed. If this is not the case, maybe the OID 2 spec can be changed to make it possible. It is always better to detect features, not versions. Claus ___ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs
Re-defining the Key-Value format (was: attribute exchange value encoding)
Johnny Bufu schrieb: So I've rewritten the encoding section, such that: - for strings, only the newline (and percent) characters are required to be escaped, (to comply with OpenID's data formats), using percent-encoding; This means that '%' characters need to be encoded up to three times: For example: User name: 100%pure Embedded in an URI that is the value of the attribute: http://example.com/foo/100%25pure Encoded for AX using Key-Value Form Encoding (OID 2, 4.1.1.) openid.ax.foo.uri:http://example.com/foo/100%2525pure Encoded for AX using HTTP Encoding (OID 2, 4.1.2.) openid.ax.foo.uri=http%3A//example.com/foo/100%2525pure I don't think it's a good idea to introduce a solution to the \n problem in AX only. It should be part of the base spec (OpenId 2 Authentication). What about changing section 4.1.1. from: A message in Key-Value form is a sequence of lines. Each line begins with a key, followed by a colon, and the value associated with the key. The line is terminated by a single newline (UCS codepoint 10, \n). A key or value MUST NOT contain a newline and a key also MUST NOT contain a colon. to (wording adapted from RFC 2822): A message in Key-Value form consists of fields composed of a key, followed by a colon (:), followed by a value, and terminated by a single LF (UCS codepoint 10, \n). The key MUST be composed of printable US-ASCII characters except : (i.e. characters that have values between 33 and 57, or between 59 and 126, inclusive). The key MUST NOT start with a '*' (codepoint 32). The value MUST be composed of a sequence of characters encoded as UTF-8. If an extension to this specification allows values that contain LF (UCS codepoint 10, \n) characters, these LF (UCS codepoint 10, \n) characters MUST be encoded as a sequence of LF, '*', ':' (UCS codepoints 10, 42, 32, \n*:). [Unlike the suggested %-encoding, this encoding is compatible with the current spec as long as LF characters are not actually allowed within the value. It's similar to the RFC 2822 folding mechanism but folding is only allowed (and mandated) where a LF is to be encoded. Further, the continuation line is compatible with the key-value format, using '*' as a pseudo key value.] If an extension to this specification needs to allows binary data in values, i.e. if it allows arbitrary bytes not to be interpreted as UTF-8 characters, it MAY use Base64 [reference] encoding for the specification of the format of that value. [Note: Base64, is quite efficient when it comes to encoding the message in HTTP Encoding (OID 2, 4.1.2.). Unencoded bytes would have to use the %-encoding, rougly doubling the size. Unencoded bytes also create problems if implementations think they should be UTF-8, e.g. if perl strings are used.] - base64 must be used for encoding binary data, and defined an additional field for this: openid.ax.encoding.alias=base64 I think it's much simpler if the specification of the field value format just says UTF-8 or Base64 and if the same encoding is used for all actual values, even those that would not need any encoding. Claus ___ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs
Persistent Identifiers (was: Proposal for Recycling Identifiers in OpenID 2.0)
Dmitry Shechtman schrieb: This is definitely an interesting proposal. However, it only attempts to solve the recycling problem, whereas canonical IDs would solve this and several more. I think the best solution would be a Persistent Identifier. If the OpenID Provider returns a different Persistent Identifier, the Relying Party can assume the ID has been recycled. There's no reason the Persistent Identifer should be part of the URI. It can just be returned by the OpenID Provider as a message parameter, which can be ignored by Relying Parties not interested. Or even better, it could only be used in Attribute Exchange. If Persistent Identifiers are unique across OpenID providers, they can even be used to allow users to change their claimed identity. This can be achieved by using a public cryptographic key as the Persistent Identifier (well, semi-persistent actually): A rough sketch: Login: On login, the RP tells the OP that it wants to use a Persistent Identifier and requests ALL keys for the user: RP = OP: openid.persid.version=0.1 openid.persid.getkey=ALL The RP returns a list of its public keys, both current keys and obsolete previous keys (which it can still authenticate as): OP = RP: openid.persid.key.0=DH:base 64... openid.persid.key.1=RSA:base 64... openid.persid.key.2.obsolete=RSA:base 64... openid.persid.key.3.obsolete=FOO:base 64... The RP has stored the key #2 as a persistent identifier for a user, so it asks the OP to authenticate with that key: RP = OP: openid.persid.challenge.0.key=RSA:base 64... openid.persid.challenge.0.id=120938231 openid.persid.challenge.0.data=base 64... The OP presents the correct answer but wants the RP to update to the new persistent identifier: OP = RP: openid.persid.response.id=120938232 openid.persid.response.data=RSA:base 64... The RP now accepts the user as being identical to the user which had key #2. It now stores key #1, which is the current RSA key, in its database for that user, possibly overwriting key #2. It might also update the OpenID identifier. Login with known key: Here, the RP already has a key for the claimed identifier, so it just sends it with the initial request: RP = OP: openid.persid.version=0.1 openid.persid.getkey=CURRENT openid.persid.challenge.0.key=RSA:base 64... openid.persid.challenge.0.id=120938232 openid.persid.challenge.0.data=base 64... The RP can now immediately return the correct answer along with a list of its CURRENT keys: OP = RP: openid.persid.response.0.id=120938231 openid.persid.response.0.data=base 64... openid.persid.key.0=RSA:base 64... openid.persid.key.1=RSA:base 64... As above, the RP now accepts the user as being identical to the user which had key #2 and stores key #1, which is the current RSA key, in its database. Claus ___ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs
Re: HTML discovery: SGML entities and charsets
Peter Watkins schrieb: I don't think it's reasonable to expect RP code to be capable of parsing every possible charset in which an HTML page might be encoded. I also don't think it's reasonable to specify specific charsets that RPs should be able to decode and then require OpenID users to use those charsets in their web pages just so RPs can parse these two link elements. I believe the contents of those two tags' HREF attributes should be defined as UTF-8 representations of the URLs, encoded per RFC 3986. URIs are always confined to a small number of characters, roughly a subset of the ASCII repertoire. Characters, which may be represented in ASCII, UTF-32, EBCDIC, ink on paper, etc., not bytes (or coded characters) Non-ASCII characters (or special characters) are not a concern when embedding finished URIs in HTML documents. They are only a concern when making the URIs (see below). Percent-encoding is only a concern when making the URIs. Actually, just two characters may need to be encoded when a URI is embedded in a HTML document: and ' (the later only if the attribute is for some inexplicable reason using ' as quotes). Only '' has a named entity: 'amp;'. All the others defined in HTML are either above U+007E or specials not allowed within URIs. However, any other character *may* be encoded. For example, '@' might be encoded as '#x40' and 'A' might be encoded as '#65'. Actually, handling different legacy charsets is very easy: If they're an extended ASCII charset, just don't try to interpret a character with a set 8th bit. That does not work with UTF-16, UTF-32, ISO 2002 and EBCDIC, however. So the two questions to answer here are: . What charsets does a RP need to be able to handle? - extended ASCII (including UTF-8, ISO 8859, GB 18030) - UTF-16 (including endian detection) - UTF-32? - ISO 2022 (a switching charset that might fool ASCII parsers, any sequence not in the ASCII plane can be ignored just like 8bit chars with extended ASCII)? - EBCDIC? . What character references does a RP need to handle? - entity references (i.e. 'amp;') - numeric character references ('#xNN' and '#NNN') A link in a big5 HTML document to an internationalized URL may not be deciperable by my web browser, and that's normally OK because an internationalized Chinese URL in a Chinese-language document is probably nothing I could read, anyway. HTML is designed for human communication. Well, if we're talking about IRIs (Internationalised Resource Identifiers), that's a completely different story. Like URIs, they are made of characters. However, these characters may now be characters above U+007E. When embedding them in HTML, there's a lot of additional named entity references. Further, you can't get away with just handling extended ASCII as ASCII. URIs can be mapped to IRIs by undoing the percent-encoding for bytes that are valid UTF-8 sequences and interpreting the result as UTF-8. For example, http://example.com/f%C3%C4rber can be mapped to http://example.com/färber. However, http://example.com/f%E4rber can not be mapped to an IRI (i.e. the IRI is just identical to the URI). Currently, the HTML 4.01 spec does not formally allow IRIs. However, the HTML 5 draft does. With all of this, the real question here is: . Should support for IRIs be required? If IRIs are allowed, the number of charset and named entity references a RP must be able to handle, is much larger. So if yes, the same questions as above come up again: . What charset does a RP need to be able to handle? - ISO 8859-X, Windows-1252? - UTF-8 - GB 18030 - EUC - UTF-16 (including endian detection) - UTF-32? - ... . What character references does a RP need to handle? - entity references (full HTML list) - numeric character references ('#xNN' and '#NNN') Instead of thinking of the OpenID2 values as text, think of them as binary data that a machine needs to read. If an internationalized Chinese URL is converted to UTF-8 bytes and then URI-encoded, it is then reduced to lowest- common-denominator text: US-ASCII. That's basically what URIs already do. No need to reinvent the wheel. Consider an identity URL like http://www.färber.de/claus In UTF-8, ä is represented by bytes 0xC3 and 0xA4, so a RFC3986 encoded UTF-8 representation of http://www.färber.de/claus would be http://www.f%C3%A4rber.de/claus Or just http://www.xn--frber-gra.de/claus, which also works with software that can't handle IDNs at all. It does not work like that with the path component of HTTP URIs, however. http://example.com/f%E4rber (using ISO 8859-1), http://example.com/f%7Brber (ISO 646 DE) and http://example.com/f%C3%C4rber (UTF-8) are all valid URIs. As a general rule, URIs contain bytes (possibly percent-encoded), not characters. The mapping between these bytes and characters can be made by the URI specification (e.g. domain names), by the server that hosts the resource (e.g. a Windows
Re: Re-defining the Key-Value format
Johnny Bufu schrieb: On 28-May-07, at 5:55 AM, Claus Färber wrote: Johnny Bufu schrieb: So I've rewritten the encoding section, such that: This means that '%' characters need to be encoded up to three times: I'm not sure I follow your reasoning all the way; please see my comments below and point where I'm wrong. For example: User name: 100%pure Embedded in an URI that is the value of the attribute: http://example.com/foo/100%25pure This encoding happens outside of the OpenID / AX protocols. Yes. It's just for illustration. But yes, I counted that as the first encoding. However, two of the three encodings happen in AX and OpenID. Further, I should have mentioned one more step here: Encoded as an AX value: openid.ax.foo.uri:http://example.com/foo/100%2525pure Encoded for AX using Key-Value Form Encoding (OID 2, 4.1.1.) openid.ax.foo.uri:http://example.com/foo/100%2525pure AX has nothing to do directly with key-value encoding. I see no reference to percent-encoding from OpenID2's section 4.1.1. But yes, using the AX 3.3.1 Default Encoding of a String Value [1], if user_name=100%pure the field in an key-value representation would be: openid.ax.foo.value=100%25pure This looks wrong. In Key-Value Form, it would be: ax.foo.value:100%25pure (A colon, no openid. prefix.) In HTTP Encoding, it would be: openid.foo.value=100%2525pure (First encoding from AX, second encoding from HTTP Encoding.) Encoded for AX using HTTP Encoding (OID 2, 4.1.2.) openid.ax.foo.uri=http%3A//example.com/foo/100%2525pure I got this wrong, it should be: openid.ax.foo.uri=http%3A//example.com/foo/100%252525pure Yes, there would be a double-encoding of the % char, one done by AX 3.3.1, and another x-www-form encoding as required by OpenID 4.1.2 for indirect messages. (plus the one by URI encoding.) I don't think it's a good idea to introduce a solution to the \n problem in AX only. It should be part of the base spec (OpenId 2 Authentication). What do you see as pros / cons for each proposed solution? AX is not the only OpenID extension that might need to encode \n characters. If other specifications need to encode \n characters, it is easier to write such specifications if the base specification (OpenID 2.0 Authentication) provides the encoding. It is also less likely that writers of such specifications invent their own ad-hoc encoding (or miss the problem at all). The same is true for binary data: If the OpenID 2.0 specification RECOMMENDs base64, it's less likely that authors of extension specs invent their own encoding (which might be incompatible with software that expects UTF-8 and/or produces larger messages in HTTP Encoding.) What about changing section 4.1.1. from: A message in Key-Value form is a sequence of lines. Each line begins with a key, followed by a colon, and the value associated with the key. The line is terminated by a single newline (UCS codepoint 10, \n). A key or value MUST NOT contain a newline and a key also MUST NOT contain a colon. to (wording adapted from RFC 2822): A message in Key-Value form consists of fields composed of a key, followed by a colon (:), followed by a value, and terminated by a single LF (UCS codepoint 10, \n). The key MUST be composed of printable US-ASCII characters except : (i.e. characters that have values between 33 and 57, or between 59 and 126, inclusive). The key MUST NOT start with a '*' (codepoint 32). The value MUST be composed of a sequence of characters encoded as UTF-8. If an extension to this specification allows values that contain LF (UCS codepoint 10, \n) characters, these LF (UCS codepoint 10, \n) characters MUST be encoded as a sequence of LF, '*', ':' (UCS codepoints 10, 42, 32, \n*:). [Unlike the suggested %-encoding, this encoding is compatible with the current spec as long as LF characters are not actually allowed within the value. What makes the proposed percent-encoding incompatible with the current OpenID spec? You can't use it as an encoding for _all_ Key-Value-Form messages, including those already specified in the base specification, as it encodes the '%' character differently: openid.return_to=http://example.com/f%E4rber vs. openid.ax.foo.return_to=http://example.com/f%25E4rber. If you want to change the encoding in the base specification (which I want to do), it better be identical for all characters except LF. It's similar to the RFC 2822 folding mechanism but folding is only allowed (and mandated) where a LF is to be encoded. Further, the continuation line is compatible with the key-value format, using '*' as a pseudo key value.] If an extension to this specification needs to allows
Re: attribute exchange value encoding
Johnny Bufu schrieb: I believe the HTTP encoding [1] in the OpenID spec will take care of this part, i.e. before putting the OpenID + AX message on the wire, the OpenID layer has to HTTP-encode it. Maybe Base 64 Encoding with URL and Filename Safe Alphabet (RFC 3548, section 4) should be used for efficiency. If 2 out of 64 characters need to be %-encoded, this increases the size by an average of 6,25 %. (I'm ignoring the '=' as it only appears once.) The total overhead of Base64 changes from 33,3 % to 41,7 %. Claus ___ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs
Re: attribute exchange value encoding
Johnny Bufu schrieb: The attribute metadata can be used to define attribute-specific encodings, which should deal with issues like this. Ah, so the _usual_ way is that the metadata (Can this be renamed to datatype definition? metadata is very misleading.) defines the encoding. For binary data, it will be base64Binary or hexBinary as defined in XML schema. Correct? The AX protocol has to stay simple (that was overwhelming feedback I've received at IIW). The base64 encoding is there as a convenience: if a number of OPs and RPs agree on an attribute type (the classical example being an avatar image) but don't want to go to the trouble of publishing metadata information, In other words: The metadata is implicitly agreed upon by the parties involved. If they can agree on the meaning and the base format (integer, string, *binary,...) they can also agree on an encoding (e.g. agree on base64Binary instead of *binary). So I don't think AX needs means to flag base64 data. The parties involved should know when base64Binary or hexBinary is used by out of band information (metadata/datatype definition or mutual agreement). In other words, AX should just restrict values to UTF-8 strings and recommend base64Binary (or hexBinary) for datatypes (datatypes, not data!) that can't be represented as UTF-8 strings. Claus ___ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs
Re: Specifying identifier recycling
Nat Sakimura schrieb: 1) Storing many users' private key on the server in decryptable format is not very safe. In your proposal, it looks like that OP is going to hold the private key for each user in decryptable format. Considering that most large scale privacy leakage happens at the server side, I have got a feeling that such thing like private key in a shared location. If you can't trust your OP to keep your secrets secret, there's nothing you can do about that. Of course, you would not use a key that's valid as a key for anything else than OpenID. It's also possible that the OP does not know the private key by using two key pairs: . pers_secret, pers_public (the identity) . temp_secret, temp_public The OpenID Povider only has the following: . pers_public . temp_secret, temp_public . cert = sign(temp_public, with_key=pers_secret) The _real_ private key, pers_secret, is kept by the user. If the server is compromised (or becomes rouge, trying to steal the identity), the user can still take his identity elsewhere by signing the tmp2_public key of another server. Claus ___ specs mailing list specs@openid.net http://openid.net/mailman/listinfo/specs