Re: HTML discovery: SGML entities and charsets

2007-05-20 Thread Claus Färber
Peter Watkins schrieb:
 7.3.3 in draft 11 says
 
 The openid2.provider and openid2.local_id URLs MUST NOT include entities 
 other than amp;, lt;, gt;, and quot;. Other characters that would 
 not be valid in the HTML document or that cannot be represented in the 
 document's character encoding MUST be escaped using the percent-encoding 
 (%xx) mechanism described in [RFC3986] (Berners-Lee, T., .Uniform Resource 
 Identifiers (URI): Generic Syntax,. .).

Please note that the draft is completely broken here:

It's unclear: The first sentence talks about entities, which can only 
refer to character entity references (HTML 4.01, 5.3.2). The second 
sentence mandates RFC 3986 encoding, which is plain wrong because it 
changes the URI. It does not talk about numeric character references 
at all (which are _not_ entities, see HTML 4.01, 5.3.1), which is the 
only correct way to encode an URI that contains a '/#39;/#x27;.

It's incompatible: A HTML editor, tool or filter may assume that 
changing any characters to entities is allowed, so it may change 
http://[EMAIL PROTECTED] to 
http://example.org?login=user#64;example.net; withoug changing the 
meaning. The spec breaks this assumption.

It dangerous: It's there to allow RP implementations to use a quick and 
dirty regexp-based parser instead of a true HTML parser, which (a) may 
break with completly valid HTML documents (bad user experience) and (b) 
may circumvent security measures taken by the site owners.

 1) Why are the characters , , , and  allowed to be represented with those
 SGML entities? Why not require them to be encoded per RFC 3986 as %26, %3C,
 %3E, and %22? 

The point of RFC 3986 encoding is that URL special chars lose their 
special meaning _within_ _the_ _URL_:

http://example.org/?foo=1bar=2 contains two parameters: foo with the 
value 1 and bar with the value 2.
http://example.org/?foo=1%26bar=2 contains a _signle_ parameter, foo, 
with the value 1bar2.

The point of HTML encoding is that HTML special chars lose their special 
meaning _within_ _HTML_:

a href=http://example.org/?x=1copy=2; is a link to the IRI
http://example.org/?x=1©=2, which is equivalent with the ASCII URI 
http://example.org/?x=1%C2%A9%3D2.

a href=http://example.org/?x=1amp;copy=2; is a link to the URI
http://example.org/?x=1copy=2

However,  and  are not legal within URIs and IRIs anyways. Other 
characters with named entities are outside the ASCII range and thus 
illegal in URIs but not IRIs.

 2) Also, should 7.3.3 specify that, as with the key/value data pairs, these
 values be encoded in UTF-8? Requiring UTF-8 would free RP code from having
 to understand different HTML character sets, and would allow users to encode
 their HTML delivery pages in the charset of their choosing.

No, the whole HTML document must use the same character set.

However, unless you're using IRIs, you can usually get away with 
treating the document as ASCII; you'll have some characters with the 8th 
bit set but you can simply ignore them if you just want to extract URIs.

Problematic charsets include ISO-2022 (common), Shift-JIS (very common, 
only ~ a problem wrt URIs, which can't be encoded at all), UTF-16 
(rare), UTF-32 (very rare), EBCDIC-based charsets (very rare) and 
national ISO-646 variants.

Claus

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs


Re: HTML discovery: SGML entities and charsets

2007-05-28 Thread Claus Färber
Josh Hoyt schrieb:
 There has been a little discussion in the past about the restriction
 on allowed character entity references. I don't think there has been
 any about numeric character references, except in lumping them in with
 character entity references.
 
 These restrictions live on from the OpenID 1 specification, and were
 preserved primarily to ease backwards compatibility (IIRC).

It seems that it has been taken from the pingback specification:
http://www.hixie.ch/specs/pingback/pingback#TOC2.2

The rationale given is that it should not be necessary to implement a
full HTML parser. Unfortunatly, this allegation is completly bogous: As
HTML has a context-sensitive grammar, you just can't parse it with
regular expressions.

If you try, you will ineviteably write a parser that falls for some HTML
constructs users might expect to work. (For example: comments. It nearly
unimaginable but users might even try to put comment markers around an
OpenID link, add another OpenID link and expect RPs to use the one not
within a comment?)
Others will also try and will ineviteably write a parser that falls for
some _different_ HTML constructs.

The result is that one RP might work with an URL (because it can handle
comments within the HTML) and another one does not. Without looking at
the code of the RP's HTML parser, it is nearly impossible for the user
to tell why some RPs fail.
If that isn't extremly bad user experience, what is?

(As a side note: There's no telling whether there's a security risk with
some RPs, either.)

The only way around this is using a real HTML parser. If you do, there's
no reason not to parse and handle all character references.

Claus

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs


Re: Final outstanding issues with the OpenID 2.0 Authenticationspecification

2007-05-28 Thread Claus Färber
Marius Scurtescu schrieb:
 The new attribute values are needed in order to signal an OpenID 2  
 provider.

Why is this necessary? Is OpenID 2 incompatible? In other words, what 
happens if an OpenID 2 Relying Party tries to talk to an OpenID 1.x 
Provider?

If the OpenID 1.x Provider just ignores additional message fields (i.e. 
treats them like an unknown extension), then no new rel values are 
needed. If this is not the case, maybe the OID 2 spec can be changed to 
make it possible.

It is always better to detect features, not versions.

Claus

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs


Re-defining the Key-Value format (was: attribute exchange value encoding)

2007-05-28 Thread Claus Färber
Johnny Bufu schrieb:
 So I've rewritten the encoding section, such that:
 
 - for strings, only the newline (and percent) characters are required  
 to be escaped,
(to comply with OpenID's data formats), using percent-encoding;

This means that '%' characters need to be encoded up to three times:

For example:

User name: 100%pure

Embedded in an URI that is the value of the attribute:
   http://example.com/foo/100%25pure

Encoded for AX using Key-Value Form Encoding  (OID 2, 4.1.1.)
   openid.ax.foo.uri:http://example.com/foo/100%2525pure

Encoded for AX using HTTP Encoding (OID 2, 4.1.2.)
   openid.ax.foo.uri=http%3A//example.com/foo/100%2525pure

I don't think it's a good idea to introduce a solution to the \n 
problem in AX only. It should be part of the base spec (OpenId 2 
Authentication).

What about changing section 4.1.1. from:

 A message in Key-Value form is a sequence of lines.  Each
 line begins with a key, followed by a colon, and the value
 associated with the key.  The line is terminated by a
 single newline (UCS codepoint 10, \n). A key or value
 MUST NOT contain a newline and a key also MUST NOT contain
 a colon.

to (wording adapted from RFC 2822):

A message in Key-Value form consists of fields composed of
 a key, followed by a colon (:), followed by a value, and
 terminated by a single LF (UCS codepoint 10, \n).

 The key MUST be composed of printable US-ASCII characters except
 : (i.e. characters that have values between 33 and 57, or
 between 59 and 126, inclusive). The key MUST NOT start with
 a '*' (codepoint 32).

 The value MUST be composed of a sequence of characters encoded
 as UTF-8. If an extension to this specification allows values
 that contain LF (UCS codepoint 10, \n) characters, these LF
 (UCS codepoint 10, \n) characters MUST be encoded as a
 sequence of LF, '*', ':' (UCS codepoints 10, 42, 32,  \n*:).

[Unlike the suggested %-encoding, this encoding is compatible with
the current spec as long as LF characters are not actually allowed
within the value.
It's similar to the RFC 2822 folding mechanism but folding is only
allowed (and mandated) where a LF is to be encoded. Further, the
continuation line is compatible with the key-value format, using '*'
as a pseudo key value.]

 If an extension to this specification needs to allows binary
 data in values, i.e. if it allows arbitrary bytes not to be
 interpreted as UTF-8 characters, it MAY use Base64 [reference]
 encoding for the specification of the format of that value.

[Note: Base64, is quite efficient when it comes to encoding the
message in HTTP Encoding (OID 2, 4.1.2.). Unencoded bytes would have
to use the %-encoding, rougly doubling the size. Unencoded bytes also
create problems if implementations think they should be UTF-8, e.g.
if perl strings are used.]

 - base64 must be used for encoding binary data, and defined
an additional field for this:
   openid.ax.encoding.alias=base64

I think it's much simpler if the specification of the field value format 
just says UTF-8 or Base64 and if the same encoding is used for all 
actual values, even those that would not need any encoding.

Claus

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs


Persistent Identifiers (was: Proposal for Recycling Identifiers in OpenID 2.0)

2007-05-28 Thread Claus Färber
Dmitry Shechtman schrieb:
 This is definitely an interesting proposal. However, it only attempts to
 solve the recycling problem, whereas canonical IDs would solve this and
 several more.

I think the best solution would be a Persistent Identifier. If the 
OpenID Provider returns a different Persistent Identifier, the Relying 
Party can assume the ID has been recycled.

There's no reason the Persistent Identifer should be part of the URI. It 
can just be returned by the OpenID Provider as a message parameter, 
which can be ignored by Relying Parties not interested. Or even better, 
it could only be used in Attribute Exchange.

If Persistent Identifiers are unique across OpenID providers, they can 
even be used to allow users to change their claimed identity. This can 
be achieved by using a public cryptographic key as the Persistent 
Identifier (well, semi-persistent actually):

A rough sketch:

   Login:

 On login, the RP tells the OP that it wants to use a Persistent
 Identifier and requests ALL keys for the user:

 RP = OP:
   openid.persid.version=0.1
   openid.persid.getkey=ALL

 The RP returns a list of its public keys, both current keys and
 obsolete previous keys (which it can still authenticate as):

 OP = RP:
   openid.persid.key.0=DH:base 64...
   openid.persid.key.1=RSA:base 64...
   openid.persid.key.2.obsolete=RSA:base 64...
   openid.persid.key.3.obsolete=FOO:base 64...

 The RP has stored the key #2 as a persistent identifier for a user,
 so it asks the OP to authenticate with that key:

 RP = OP:
   openid.persid.challenge.0.key=RSA:base 64...
   openid.persid.challenge.0.id=120938231
   openid.persid.challenge.0.data=base 64...

 The OP presents the correct answer but wants the RP to update to the
 new persistent identifier:

 OP = RP:
   openid.persid.response.id=120938232
   openid.persid.response.data=RSA:base 64...

 The RP now accepts the user as being identical to the user which had
 key #2. It now stores key #1, which is the current RSA key, in its
 database for that user, possibly overwriting key #2. It might also
 update the OpenID identifier.

   Login with known key:

 Here, the RP already has a key for the claimed identifier, so it
 just sends it with the initial request:

 RP = OP:
   openid.persid.version=0.1
   openid.persid.getkey=CURRENT
   openid.persid.challenge.0.key=RSA:base 64...
   openid.persid.challenge.0.id=120938232
   openid.persid.challenge.0.data=base 64...

 The RP can now immediately return the correct answer along with a
 list of its CURRENT keys:

 OP = RP:
   openid.persid.response.0.id=120938231
   openid.persid.response.0.data=base 64...
   openid.persid.key.0=RSA:base 64...
   openid.persid.key.1=RSA:base 64...

 As above, the RP now accepts the user as being identical to the user
 which had key #2 and stores key #1, which is the current RSA key, in
 its database.

Claus

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs


Re: HTML discovery: SGML entities and charsets

2007-05-28 Thread Claus Färber
Peter Watkins schrieb:
 I don't think it's reasonable to expect RP code to be capable of parsing
 every possible charset in which an HTML page might be encoded.
 
 I also don't think it's reasonable to specify specific charsets that RPs
 should be able to decode and then require OpenID users to use those charsets
 in their web pages just so RPs can parse these two link elements.
 
 I believe the contents of those two tags' HREF attributes should be defined
 as UTF-8 representations of the URLs, encoded per RFC 3986.

URIs are always confined to a small number of characters, roughly a
subset of the ASCII repertoire. Characters, which may be represented in
ASCII, UTF-32, EBCDIC, ink on paper, etc., not bytes (or coded characters)

Non-ASCII characters (or special characters) are not a concern when
embedding finished URIs in HTML documents. They are only a concern when
making the URIs (see below). Percent-encoding is only a concern when
making the URIs.

Actually, just two characters may need to be encoded when a URI is
embedded in a HTML document:  and ' (the later only if the
attribute is for some inexplicable reason using ' as quotes).
Only '' has a named entity: 'amp;'. All the others defined in HTML are 
either above U+007E or specials not allowed within URIs.
However, any other character *may* be encoded. For example, '@' might
be encoded as '#x40' and 'A' might be encoded as '#65'.

Actually, handling different legacy charsets is very easy: If they're
an extended ASCII charset, just don't try to interpret a character with
a set 8th bit.
That does not work with UTF-16, UTF-32, ISO 2002 and EBCDIC, however.

So the two questions to answer here are:

. What charsets does a RP need to be able to handle?
   - extended ASCII (including UTF-8, ISO 8859, GB 18030)
   - UTF-16 (including endian detection)
   - UTF-32?
   - ISO 2022 (a switching charset that might fool ASCII parsers,
   any sequence not in the ASCII plane can be ignored just like
   8bit chars with extended ASCII)?
   - EBCDIC?

. What character references does a RP need to handle?
   - entity references (i.e. 'amp;')
   - numeric character references ('#xNN' and '#NNN')

 A link in a big5 HTML document to an internationalized URL 
 may not be deciperable by my web browser, and that's normally OK because
 an internationalized Chinese URL in a Chinese-language document is probably 
 nothing I could read, anyway. HTML is designed for human communication.

Well, if we're talking about IRIs (Internationalised Resource
Identifiers), that's a completely different story.

Like URIs, they are made of characters. However, these characters may
now be characters above U+007E.
When embedding them in HTML, there's a lot of additional named entity
references.
Further, you can't get away with just handling extended ASCII as ASCII.

URIs can be mapped to IRIs by undoing the percent-encoding for bytes
that are valid UTF-8 sequences and interpreting the result as UTF-8.

For example, http://example.com/f%C3%C4rber can be mapped to
http://example.com/färber.
However, http://example.com/f%E4rber can not be mapped to an IRI (i.e.
the IRI is just identical to the URI).

Currently, the HTML 4.01 spec does not formally allow IRIs. However, the
HTML 5 draft does.

With all of this, the real question here is:

. Should support for IRIs be required?

If IRIs are allowed, the number of charset and named entity references a
RP must be able to handle, is much larger. So if yes, the same questions
as above come up again:

. What charset does a RP need to be able to handle?
   - ISO 8859-X, Windows-1252?
   - UTF-8
   - GB 18030
   - EUC
   - UTF-16 (including endian detection)
   - UTF-32?
   - ...

. What character references does a RP need to handle?
   - entity references (full HTML list)
   - numeric character references ('#xNN' and '#NNN')

 Instead of thinking of the OpenID2 values as text, think of them as
 binary data that a machine needs to read. If an internationalized Chinese URL
 is converted to UTF-8 bytes and then URI-encoded, it is then reduced to 
 lowest-
 common-denominator text: US-ASCII.

That's basically what URIs already do. No need to reinvent the wheel.

 Consider an identity URL like http://www.färber.de/claus
 
 In UTF-8, ä is represented by bytes 0xC3 and 0xA4, so a RFC3986 encoded 
 UTF-8 representation of http://www.färber.de/claus would be
   http://www.f%C3%A4rber.de/claus

Or just http://www.xn--frber-gra.de/claus, which also works with
software that can't handle IDNs at all.

It does not work like that with the path component of HTTP URIs,
however. http://example.com/f%E4rber (using ISO 8859-1),
http://example.com/f%7Brber (ISO 646 DE) and
http://example.com/f%C3%C4rber (UTF-8) are all valid URIs.

As a general rule, URIs contain bytes (possibly percent-encoded), not
characters. The mapping between these bytes and characters can be made
by the URI specification (e.g. domain names), by the server that hosts
the resource (e.g. a Windows 

Re: Re-defining the Key-Value format

2007-05-28 Thread Claus Färber
Johnny Bufu schrieb:
 On 28-May-07, at 5:55 AM, Claus Färber wrote:
 Johnny Bufu schrieb:
 So I've rewritten the encoding section, such that:
 This means that '%' characters need to be encoded up to three times:
 I'm not sure I follow your reasoning all the way; please see my  
 comments below and point where I'm wrong.
 
 For example:

 User name: 100%pure

 Embedded in an URI that is the value of the attribute:
http://example.com/foo/100%25pure
 
 This encoding happens outside of the OpenID / AX protocols.

Yes. It's just for illustration. But yes, I counted that as the first 
encoding. However, two of the three encodings happen in AX and OpenID.

Further, I should have mentioned one more step here:

Encoded as an AX value:
 openid.ax.foo.uri:http://example.com/foo/100%2525pure

 Encoded for AX using Key-Value Form Encoding  (OID 2, 4.1.1.)
openid.ax.foo.uri:http://example.com/foo/100%2525pure
 
 AX has nothing to do directly with key-value encoding. I see no  
 reference to percent-encoding from OpenID2's section 4.1.1.

 But yes, using the AX 3.3.1 Default Encoding of a String Value [1],  
 if user_name=100%pure the field in an key-value representation would be:
 
   openid.ax.foo.value=100%25pure

This looks wrong. In Key-Value Form, it would be:

 ax.foo.value:100%25pure

(A colon, no openid. prefix.)

In HTTP Encoding, it would be:

 openid.foo.value=100%2525pure

(First encoding from AX, second encoding from HTTP Encoding.)

 Encoded for AX using HTTP Encoding (OID 2, 4.1.2.)
openid.ax.foo.uri=http%3A//example.com/foo/100%2525pure

I got this wrong, it should be:
 openid.ax.foo.uri=http%3A//example.com/foo/100%252525pure

 Yes, there would be a double-encoding of the % char, one done by AX  
 3.3.1, and another x-www-form encoding as required by OpenID 4.1.2  
 for indirect messages.

(plus the one by URI encoding.)

 I don't think it's a good idea to introduce a solution to the \n
 problem in AX only. It should be part of the base spec (OpenId 2
 Authentication).
 
 What do you see as pros / cons for each proposed solution?

AX is not the only OpenID extension that might need to encode \n 
characters.

If other specifications need to encode \n characters, it is easier to 
write such specifications if the base specification (OpenID 2.0 
Authentication) provides the encoding. It is also less likely that 
writers of such specifications invent their own ad-hoc encoding (or miss 
the problem at all).

The same is true for binary data: If the OpenID 2.0 specification 
RECOMMENDs base64, it's less likely that authors of extension specs 
invent their own encoding (which might be incompatible with software 
that expects UTF-8 and/or produces larger messages in HTTP Encoding.)

 What about changing section 4.1.1. from:

  A message in Key-Value form is a sequence of lines.  Each
  line begins with a key, followed by a colon, and the value
  associated with the key.  The line is terminated by a
  single newline (UCS codepoint 10, \n). A key or value
  MUST NOT contain a newline and a key also MUST NOT contain
  a colon.

 to (wording adapted from RFC 2822):

  A message in Key-Value form consists of fields composed of
  a key, followed by a colon (:), followed by a value, and
  terminated by a single LF (UCS codepoint 10, \n).

  The key MUST be composed of printable US-ASCII characters  
 except
  : (i.e. characters that have values between 33 and 57, or
  between 59 and 126, inclusive). The key MUST NOT start with
  a '*' (codepoint 32).

  The value MUST be composed of a sequence of characters  
 encoded
  as UTF-8. If an extension to this specification allows values
  that contain LF (UCS codepoint 10, \n) characters, these LF
  (UCS codepoint 10, \n) characters MUST be encoded as a
  sequence of LF, '*', ':' (UCS codepoints 10, 42, 32,   
 \n*:).

 [Unlike the suggested %-encoding, this encoding is compatible with
 the current spec as long as LF characters are not actually allowed
 within the value.
 
 What makes the proposed percent-encoding incompatible with the  
 current OpenID spec?

You can't use it as an encoding for _all_ Key-Value-Form messages, 
including those already specified in the base specification, as it 
encodes the '%' character differently:
   openid.return_to=http://example.com/f%E4rber
vs.
   openid.ax.foo.return_to=http://example.com/f%25E4rber.

If you want to change the encoding in the base specification (which I 
want to do), it better be identical for all characters except LF.

 It's similar to the RFC 2822 folding mechanism but folding is only
 allowed (and mandated) where a LF is to be encoded. Further, the
 continuation line is compatible with the key-value format,  
 using '*'
 as a pseudo key value.]

  If an extension to this specification needs to allows

Re: attribute exchange value encoding

2007-05-29 Thread Claus Färber
Johnny Bufu schrieb:
 I believe the HTTP encoding [1] in the OpenID spec will take care of  
 this part, i.e. before putting the OpenID + AX message on the wire,  
 the OpenID layer has to HTTP-encode it.

Maybe Base 64 Encoding with URL and Filename Safe Alphabet (RFC 3548, 
section 4) should be used for efficiency.

If 2 out of 64 characters need to be %-encoded, this increases the size 
by an average of 6,25 %. (I'm ignoring the '=' as it only appears once.) 
The total overhead of Base64 changes from 33,3 % to 41,7 %.

Claus

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs


Re: attribute exchange value encoding

2007-05-29 Thread Claus Färber
Johnny Bufu schrieb:
 The attribute metadata can be used to define attribute-specific  
 encodings, which should deal with issues like this.

Ah, so the _usual_ way is that the metadata (Can this be renamed to 
datatype definition? metadata is very misleading.) defines the 
encoding. For binary data, it will be base64Binary or hexBinary as 
defined in XML schema. Correct?

 The AX protocol has to stay simple (that was overwhelming feedback  
 I've received at IIW). The base64 encoding is there as a convenience:  
 if a number of OPs and RPs agree on an attribute type (the classical  
 example being an avatar image) but don't want to go to the trouble of  
 publishing metadata information,

In other words: The metadata is implicitly agreed upon by the parties 
involved. If they can agree on the meaning and the base format (integer, 
string, *binary,...) they can also agree on an encoding (e.g. agree on 
base64Binary instead of *binary).

So I don't think AX needs means to flag base64 data. The parties 
involved should know when base64Binary or hexBinary is used by out of 
band information (metadata/datatype definition or mutual agreement).

In other words, AX should just restrict values to UTF-8 strings and 
recommend base64Binary (or hexBinary) for datatypes (datatypes, not 
data!) that can't be represented as UTF-8 strings.

Claus

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs


Re: Specifying identifier recycling

2007-06-02 Thread Claus Färber
Nat Sakimura schrieb:
 1) Storing many users' private key on the server in decryptable format is
 not very safe. 
 
 In your proposal, it looks like that OP is going to hold the private key for
 each user in decryptable format. Considering that most large scale privacy
 leakage happens at the server side, I have got a feeling that such thing
 like private key in a shared location.

If you can't trust your OP to keep your secrets secret, there's nothing 
you can do about that. Of course, you would not use a key that's valid 
as a key for anything else than OpenID.

It's also possible that the OP does not know the private key by using 
two key pairs:

. pers_secret, pers_public (the identity)
. temp_secret, temp_public

The OpenID Povider only has the following:

. pers_public
. temp_secret, temp_public
. cert = sign(temp_public, with_key=pers_secret)

The _real_ private key, pers_secret, is kept by the user. If the server 
is compromised (or becomes rouge, trying to steal the identity), the 
user can still take his identity elsewhere by signing the tmp2_public 
key of another server.

Claus

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs