subject:"HTML discovery\: SGML entities and charsets"

Re: HTML discovery: SGML entities and charsets

2007-05-28 Thread Claus Färber

Peter Watkins schrieb:
> I don't think it's reasonable to expect RP code to be capable of parsing
> every possible charset in which an HTML page might be encoded.
> 
> I also don't think it's reasonable to specify specific charsets that RPs
> should be able to decode and then require OpenID users to use those charsets
> in their web pages just so RPs can parse these two  elements.
> 
> I believe the contents of those two tags' HREF attributes should be defined
> as UTF-8 representations of the URLs, encoded per RFC 3986.

URIs are always confined to a small number of characters, roughly a
subset of the ASCII repertoire. Characters, which may be represented in
ASCII, UTF-32, EBCDIC, ink on paper, etc., not bytes (or coded characters)

Non-ASCII characters (or special characters) are not a concern when
embedding finished URIs in HTML documents. They are only a concern when
making the URIs (see below). Percent-encoding is only a concern when
making the URIs.

Actually, just two characters may need to be encoded when a URI is
embedded in a HTML document: "&" and "'" (the later only if the
attribute is for some inexplicable reason using "'" as quotes).
Only '&' has a named entity: '&'. All the others defined in HTML are 
either above U+007E or specials not allowed within URIs.
However, any other character *may* be encoded. For example, '@' might
be encoded as '@' and 'A' might be encoded as 'A'.

Actually, handling different legacy charsets is very easy: If they're
an extended ASCII charset, just don't try to interpret a character with
a set 8th bit.
That does not work with UTF-16, UTF-32, ISO 2002 and EBCDIC, however.

So the two questions to answer here are:

. What charsets does a RP need to be able to handle?
   - extended ASCII (including UTF-8, ISO 8859, GB 18030)
   - UTF-16 (including endian detection)
   - UTF-32?
   - ISO 2022 (a switching charset that might fool ASCII parsers,
   any sequence not in the ASCII plane can be ignored just like
   8bit chars with extended ASCII)?
   - EBCDIC?

. What character references does a RP need to handle?
   - entity references (i.e. '&')
   - numeric character references ('#&xNN' and '#&NNN')

> A link in a big5 HTML document to an internationalized URL 
> may not be deciperable by my web browser, and that's normally OK because
> an internationalized Chinese URL in a Chinese-language document is probably 
> nothing I could read, anyway. HTML is designed for human communication.

Well, if we're talking about IRIs (Internationalised Resource
Identifiers), that's a completely different story.

Like URIs, they are made of characters. However, these characters may
now be characters above U+007E.
When embedding them in HTML, there's a lot of additional named entity
references.
Further, you can't get away with just handling extended ASCII as ASCII.

URIs can be mapped to IRIs by undoing the percent-encoding for bytes
that are valid UTF-8 sequences and interpreting the result as UTF-8.

For example,  can be mapped to
.
However,  can not be mapped to an IRI (i.e.
the IRI is just identical to the URI).

Currently, the HTML 4.01 spec does not formally allow IRIs. However, the
HTML 5 draft does.

With all of this, the real question here is:

. Should support for IRIs be required?

If IRIs are allowed, the number of charset and named entity references a
RP must be able to handle, is much larger. So if yes, the same questions
as above come up again:

. What charset does a RP need to be able to handle?
   - ISO 8859-X, Windows-1252?
   - UTF-8
   - GB 18030
   - EUC
   - UTF-16 (including endian detection)
   - UTF-32?
   - ...

. What character references does a RP need to handle?
   - entity references (full HTML list)
   - numeric character references ('#&xNN' and '#&NNN')

> Instead of thinking of the OpenID2 values as text, think of them as
> binary data that a machine needs to read. If an internationalized Chinese URL
> is converted to UTF-8 bytes and then URI-encoded, it is then reduced to 
> lowest-
> common-denominator text: US-ASCII.

That's basically what URIs already do. No need to reinvent the wheel.

> Consider an identity URL like http://www.färber.de/claus
> 
> In UTF-8, "ä" is represented by bytes 0xC3 and 0xA4, so a RFC3986 encoded 
> UTF-8 representation of http://www.färber.de/claus would be
>   http://www.f%C3%A4rber.de/claus

Or just , which also works with
software that can't handle IDNs at all.

It does not work like that with the path component of HTTP URIs,
however.  (using ISO 8859-1),
 (ISO 646 DE) and
 (UTF-8) are all valid URIs.

As a general rule, URIs contain bytes (possibly percent-encoded), not
characters. The mapping between these bytes and characters can be made
by the URI specification (e.g. domain names), by the server that

Re: HTML discovery: SGML entities and charsets

2007-05-28 Thread Julian Reschke

Peter Watkins wrote:
> 7.3.3 in draft 11 says
> 
> The "openid2.provider" and "openid2.local_id" URLs MUST NOT include entities 
> other than "&", "<", ">", and """. Other characters that would 
> not be valid in the HTML document or that cannot be represented in the 
> document's character encoding MUST be escaped using the percent-encoding 
> (%xx) mechanism described in [RFC3986] (Berners-Lee, T., .Uniform Resource 
> Identifiers (URI): Generic Syntax,. .).
> 
> Questions:
> 
> 1) Why are the characters &, <, >, and " allowed to be represented with those
> SGML entities? Why not require them to be encoded per RFC 3986 as %26, %3C,
> %3E, and %22? 

"<" and ">" are not allowed in URLs anyway. An ampersand can appear in a 
URL, in which case it would have different semantics than %26.

> 2) Also, should 7.3.3 specify that, as with the key/value data pairs, these
> values be encoded in UTF-8? Requiring UTF-8 would free RP code from having
> to understand different HTML character sets, and would allow users to encode
> their HTML delivery pages in the charset of their choosing. As it stands, 
> it appears that the HTML document containing the LINK tags could be encoded 
> in any charset, with the RP responsible for decoding. With the existence 
> of "internationallized" domain names, it's quite possible that the provider 
> and local_id values will contain non-ASCII characters. Specifying UTF-8 
> encoding for HTML discovery will allow leaner, more reliable RP code.

The value of the href attribute of an HTML link is a URI, and URIs do 
not contain non-ASCII characters by definition.

Best regards, Julian

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs

Re: HTML discovery: SGML entities and charsets

2007-05-28 Thread Julian Reschke

Peter Watkins wrote:
> I believe the contents of those two tags' HREF attributes should be defined
> as UTF-8 representations of the URLs, encoded per RFC 3986.

What is an "UTF-8" representation of a URL? A URL never ever contains 
non-ASCII characters, by definition.

> But we're not talking about "text" here, and there's no expectation that the
> RP should be able to "read" the text in the HTML document at the user's 
> claimed
> identity. Instead of thinking of the OpenID2 values as text, think of them as
> binary data that a machine needs to read. If an internationalized Chinese URL
> is converted to UTF-8 bytes and then URI-encoded, it is then reduced to 
> lowest-
> common-denominator text: US-ASCII. It's an easy matter for the RP to extract 
> that and convert it back to a Unicode string, and process it properly.
> 
> Consider an identity URL like http://www.färber.de/claus

That's a IRI, not a URI or URL.

> In UTF-8, "ä" is represented by bytes 0xC3 and 0xA4, so a RFC3986 encoded 
> UTF-8 representation of http://www.färber.de/claus would be
>   http://www.f%C3%A4rber.de/claus

Nope. You can't have "a umlaut" in a URI. You can have it in a IRI, in 
which case RFC3987 describes the transformation to a URI. In this case, 
the result will be different from your example, as the non-ASCII 
character appears in the host name, for which different escaping rules 
apply.

> ...

Best regards, Julian
___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs

Re: HTML discovery: SGML entities and charsets

2007-05-28 Thread Claus Färber

Josh Hoyt schrieb:
> There has been a little discussion in the past about the restriction
> on allowed character entity references. I don't think there has been
> any about numeric character references, except in lumping them in with
> character entity references.
> 
> These restrictions live on from the OpenID 1 specification, and were
> preserved primarily to ease backwards compatibility (IIRC).

It seems that it has been taken from the pingback specification:
http://www.hixie.ch/specs/pingback/pingback#TOC2.2

The rationale given is that it should not be necessary to implement a
full HTML parser. Unfortunatly, this allegation is completly bogous: As
HTML has a context-sensitive grammar, you just can't parse it with
regular expressions.

If you try, you will ineviteably write a parser that falls for some HTML
constructs users might expect to work. (For example: comments. It nearly
unimaginable but users might even try to put comment markers around an
OpenID link, add another OpenID link and expect RPs to use the one not
within a comment?)
Others will also try and will ineviteably write a parser that falls for
some _different_ HTML constructs.

The result is that one RP might work with an URL (because it can handle
comments within the HTML) and another one does not. Without looking at
the code of the RP's HTML parser, it is nearly impossible for the user
to tell why some RPs fail.
If that isn't extremly bad user experience, what is?

(As a side note: There's no telling whether there's a security risk with
some RPs, either.)

The only way around this is using a real HTML parser. If you do, there's
no reason not to parse and handle all character references.

Claus

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs

RE: HTML discovery: SGML entities and charsets

2007-05-23 Thread Drummond Reed

>Peter Watkins wrote:
>

>
>My concrete suggestion: replace the current language
>
>Other characters that would not be valid in the HTML document or that
cannot be represented in the document's character encoding MUST be escaped
using the percent-encoding (%xx) mechanism described in [RFC3986].
>
>with this:
>
>Any character in the href attributes MAY be represented as UTF-8 data
escaped using the percent-encoding (%xx) mechanism described in [RFC3986].
Characters with Unicode values greater than u007E MUST be represented as
UTF-8 data escaped using the percent-encoding (%xx) mechanism described in
[RFC3986]. For instance, the character "ä" (umlaut a, Unicode u00E4) MUST be
represented as a six-character string like "%C3%A4" as suggested by RFC2718.

Peter, I agree UTF-8 encoding before percent-encoding must be specified, as
otherwise you don't know how to interpret the percent-encoded characters.
However, since RFC 3987 (the IRI spec) already specifies UTF-8 encoding
before percent-encoding, couldn't we just specify it by reference to both
RFC 3986 and 3987, e.g.:

Any character in the href attributes MUST be a valid URI character as
specified by [RFC3886]. If any character outside the valid URI character set
is included, it MUST be encoded using the percent-encoding (%xx) mechanism
defined in section 2.1 of [RFC3986] after first being UTF-8 encoded as
specified in [RFC3987]. For instance, the character "ä" (umlaut a, Unicode
u00E4) MUST be represented as a six-character string like "%C3%A4" as
suggested by RFC2718.

=Drummond 

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs

Re: HTML discovery: SGML entities and charsets

2007-05-23 Thread Peter Watkins

On Mon, May 21, 2007 at 11:50:32AM -0700, Josh Hoyt wrote:

> On 5/20/07, Claus Färber <[EMAIL PROTECTED]> wrote:
> > Peter Watkins schrieb:
> > > 7.3.3 in draft 11 says
> > >
> > > The "openid2.provider" and "openid2.local_id" URLs MUST NOT include 
> > > entities other than "&", "<", ">", and """. Other 
> > > characters that would not be valid in the HTML document or that cannot be 
> > > represented in the document's character encoding MUST be escaped using 
> > > the percent-encoding (%xx) mechanism described in [RFC3986] (Berners-Lee, 
> > > T., .Uniform Resource Identifiers (URI): Generic Syntax,. .).
> >
> > Please note that the draft is completely broken here:
> 
> Can you suggest improvements and examples or test cases of how you
> think it should work?
> 
> There has been a little discussion in the past about the restriction
> on allowed character entity references. I don't think there has been
> any about numeric character references, except in lumping them in with
> character entity references.
> 
> These restrictions live on from the OpenID 1 specification, and were
> preserved primarily to ease backwards compatibility (IIRC).

I don't think it's reasonable to expect RP code to be capable of parsing
every possible charset in which an HTML page might be encoded.

I also don't think it's reasonable to specify specific charsets that RPs
should be able to decode and then require OpenID users to use those charsets
in their web pages just so RPs can parse these two  elements.

I believe the contents of those two tags' HREF attributes should be defined
as UTF-8 representations of the URLs, encoded per RFC 3986.

As Claus has pointed out, this is NOT a normal way of embedding *text*
within an SGML/HTML document. That's true. Normally the HTML document would
contain text either in the appropriate bytes for the page's charset or in SGML 
entity representations. That generally works for HTML because HTML is
designed to be read by humans. If my browser doesn't understand the "big5"
charset used for Chinese text, that's normally OK because I cannot read
Chinese. A link in a big5 HTML document to an internationalized URL 
may not be deciperable by my web browser, and that's normally OK because
an internationalized Chinese URL in a Chinese-language document is probably 
nothing I could read, anyway. HTML is designed for human communication.

But we're not talking about "text" here, and there's no expectation that the
RP should be able to "read" the text in the HTML document at the user's claimed
identity. Instead of thinking of the OpenID2 values as text, think of them as
binary data that a machine needs to read. If an internationalized Chinese URL
is converted to UTF-8 bytes and then URI-encoded, it is then reduced to lowest-
common-denominator text: US-ASCII. It's an easy matter for the RP to extract 
that and convert it back to a Unicode string, and process it properly.

Consider an identity URL like http://www.färber.de/claus

In UTF-8, "ä" is represented by bytes 0xC3 and 0xA4, so a RFC3986 encoded 
UTF-8 representation of http://www.färber.de/claus would be
  http://www.f%C3%A4rber.de/claus
If the OpenID 2.0 spec made it clear that the value of these HTML discovery
attributes was to be decoded by
 1st: applying RFC3986 decoding to convert %NN values to bytes
 2nd: interpreting as UTF-8
then the string "http://www.f%C3%A4rber.de/claus"; is not ambiguous at all.

This is a compromise -- making decoding simpler for RPs and allowing simple
"straight" URLs for common ASCII-only URLs like "https://www.faerber.de/claus";.
It also seems to be in accord with the W3C's stance on Internationalized
Resource Identifiers: http://www.w3.org/International/O-URL-and-ident.html

"URIs

"Internationalization of URIs is important because URIs may contain all kinds 
of information from all kinds of protocols or formats that use characters 
beyond ASCII. The URI syntax defined in RFC 2396 currently only allows as 
subset of ASCII, about 60 characters. It also defines a way to encode 
arbitrary bytes into URI characters: a % followed by two hexadecimal digits 
(%HH-escaping). However, for historical reasons, it does not define how 
arbitrary characters are encoded into bytes before using %HH-escaping.

"Among various solutions discussed a few years ago, the use of UTF-8 as 
the preferred character encoding for URIs was judged best. This is in line 
with the IRI-to-URI conversion, which uses encoding as UTF-8 and then 
escaping with %hh:"

As for Claus' HTML editing software dilemma, I retract my comment about the
SGML entities enumerated in 7.3.3 of draft 11. 

My concrete suggestion: replace the current language

Other characters that would not be valid in the HTML document or that cannot 
be represented in the document's character encoding MUST be escaped using 
the percent-encoding (%xx) mechanism described in [RFC3986].

with this:

Any character in the href attributes MAY be represented as UTF-8 data escaped 
using the

Re: HTML discovery: SGML entities and charsets

2007-05-21 Thread Josh Hoyt

Claus,

On 5/20/07, Claus Färber <[EMAIL PROTECTED]> wrote:
> Peter Watkins schrieb:
> > 7.3.3 in draft 11 says
> >
> > The "openid2.provider" and "openid2.local_id" URLs MUST NOT include 
> > entities other than "&", "<", ">", and """. Other characters 
> > that would not be valid in the HTML document or that cannot be represented 
> > in the document's character encoding MUST be escaped using the 
> > percent-encoding (%xx) mechanism described in [RFC3986] (Berners-Lee, T., 
> > .Uniform Resource Identifiers (URI): Generic Syntax,. .).
>
> Please note that the draft is completely broken here:

Can you suggest improvements and examples or test cases of how you
think it should work?

There has been a little discussion in the past about the restriction
on allowed character entity references. I don't think there has been
any about numeric character references, except in lumping them in with
character entity references.

These restrictions live on from the OpenID 1 specification, and were
preserved primarily to ease backwards compatibility (IIRC).

Josh
___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs

Re: HTML discovery: SGML entities and charsets

2007-05-20 Thread Claus Färber

Peter Watkins schrieb:
> 7.3.3 in draft 11 says
> 
> The "openid2.provider" and "openid2.local_id" URLs MUST NOT include entities 
> other than "&", "<", ">", and """. Other characters that would 
> not be valid in the HTML document or that cannot be represented in the 
> document's character encoding MUST be escaped using the percent-encoding 
> (%xx) mechanism described in [RFC3986] (Berners-Lee, T., .Uniform Resource 
> Identifiers (URI): Generic Syntax,. .).

Please note that the draft is completely broken here:

It's unclear: The first sentence talks about "entities", which can only 
refer to "character entity references" (HTML 4.01, 5.3.2). The second 
sentence mandates RFC 3986 encoding, which is plain wrong because it 
changes the URI. It does not talk about "numeric character references" 
at all (which are _not_ entities, see HTML 4.01, 5.3.1), which is the 
only correct way to encode an URI that contains a "'"/"'"/"'".

It's incompatible: A HTML editor, tool or filter may assume that 
changing any characters to entities is allowed, so it may change 
"http://[EMAIL PROTECTED]" to 
"http://example.org?login=user@example.net"; withoug changing the 
meaning. The spec breaks this assumption.

It dangerous: It's there to allow RP implementations to use a quick and 
dirty regexp-based parser instead of a true HTML parser, which (a) may 
break with completly valid HTML documents (bad user experience) and (b) 
may circumvent security measures taken by the site owners.

> 1) Why are the characters &, <, >, and " allowed to be represented with those
> SGML entities? Why not require them to be encoded per RFC 3986 as %26, %3C,
> %3E, and %22? 

The point of RFC 3986 encoding is that URL special chars lose their 
special meaning _within_ _the_ _URL_:

http://example.org/?foo=1&bar=2 contains two parameters: "foo" with the 
value "1" and "bar" with the value "2".
http://example.org/?foo=1%26bar=2 contains a _signle_ parameter, "foo", 
with the value "1&bar2".

The point of HTML encoding is that HTML special chars lose their special 
meaning _within_ _HTML_:

http://example.org/?x=1©=2";> is a link to the IRI
http://example.org/?x=1©=2, which is equivalent with the ASCII URI 
http://example.org/?x=1%C2%A9%3D2.

http://example.org/?x=1©=2";> is a link to the URI
http://example.org/?x=1©=2

However, "<" and ">" are not legal within URIs and IRIs anyways. Other 
characters with named entities are outside the ASCII range and thus 
illegal in URIs but not IRIs.

> 2) Also, should 7.3.3 specify that, as with the key/value data pairs, these
> values be encoded in UTF-8? Requiring UTF-8 would free RP code from having
> to understand different HTML character sets, and would allow users to encode
> their HTML delivery pages in the charset of their choosing.

No, the whole HTML document must use the same character set.

However, unless you're using IRIs, you can usually get away with 
treating the document as ASCII; you'll have some characters with the 8th 
bit set but you can simply ignore them if you just want to extract URIs.

Problematic charsets include ISO-2022 (common), Shift-JIS (very common, 
only "~" a problem wrt URIs, which can't be encoded at all), UTF-16 
(rare), UTF-32 (very rare), EBCDIC-based charsets (very rare) and 
national ISO-646 variants.

Claus

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs

HTML discovery: SGML entities and charsets

2007-05-18 Thread Peter Watkins

7.3.3 in draft 11 says

The "openid2.provider" and "openid2.local_id" URLs MUST NOT include entities 
other than "&", "<", ">", and """. Other characters that would 
not be valid in the HTML document or that cannot be represented in the 
document's character encoding MUST be escaped using the percent-encoding (%xx) 
mechanism described in [RFC3986] (Berners-Lee, T., .Uniform Resource 
Identifiers (URI): Generic Syntax,. .).

Questions:

1) Why are the characters &, <, >, and " allowed to be represented with those
SGML entities? Why not require them to be encoded per RFC 3986 as %26, %3C,
%3E, and %22? 

2) Also, should 7.3.3 specify that, as with the key/value data pairs, these
values be encoded in UTF-8? Requiring UTF-8 would free RP code from having
to understand different HTML character sets, and would allow users to encode
their HTML delivery pages in the charset of their choosing. As it stands, 
it appears that the HTML document containing the LINK tags could be encoded 
in any charset, with the RP responsible for decoding. With the existence 
of "internationallized" domain names, it's quite possible that the provider 
and local_id values will contain non-ASCII characters. Specifying UTF-8 
encoding for HTML discovery will allow leaner, more reliable RP code.

-Peter

___
specs mailing list
specs@openid.net
http://openid.net/mailman/listinfo/specs

Re: HTML discovery: SGML entities and charsets

Re: HTML discovery: SGML entities and charsets

Re: HTML discovery: SGML entities and charsets

Re: HTML discovery: SGML entities and charsets

RE: HTML discovery: SGML entities and charsets

Re: HTML discovery: SGML entities and charsets

Re: HTML discovery: SGML entities and charsets

Re: HTML discovery: SGML entities and charsets

HTML discovery: SGML entities and charsets

9 matches

Site Navigation

Mail list logo

Footer information