[whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]
On Wed, 3 Dec 2008, Calogero Alex Baldacchino wrote: My concern is, a character-by-character comparison between an id value and a fragment identifier may fail several ways. What for href=#foo bar and id=foo bar ? Actual rules would strip the trailing space only for the href, so the matching would fail (but we might survive broken links). Escaping both, then comparing would succed, as well as first escaping then unescaping the href value before comparing (should it be pointed out, somewhere, that a fragment identifier must be unescaped before comparing to an id or a name? is it and I've missed it? - having space characters in the unreserved production means thy don't need to be escaped, but does it mean also they must be decoded from their pct-production, after parsing and for resolving?). The behavior specced now may change, but as it stands now unescaping is defined for fragment-identifier-to-id= matching. In general, though, the behaviour is constrained by what IE does and more to the point by what is needed by content that depends on what IE does. (You sent another couple of e-mails on the topic; I understand -- mostly -- the points you make therein, and would like to refer you to the recent thread on the topic: http://lists.w3.org/Archives/Public/public-html/2009Feb/thread.html#msg407 ...where the same issues were discussed with more concrete reference to actual implementations and constraints placed on us by legacy content.) What terminology would you prefer rather than subtree? (We can't say document, since we are also trying to define conformance rules for disconnected subtrees handled from scripts.) Uhm, it may depend on what kinds of manipulations you have in mind, whether the disconnected subtree must be anyway a whole document to fulfil the uniqueness rule, and perhaps also on what the subtree concept might be turned into by future DOM Core versions, so maybe just a clarification on what a subtree is with respect to both the document (as a tree) and the scripts handling possibilities might be enough, instead of searching a new terminology, just to 'scope' the id visibility. I mean, if the ID matching is relevant for scripts accessing the matching element through the getElementById() method, actually a document tree is always overlapping the concept of subtree, and a disconnected subtree must be a document without a browsing context; otherwise, if other dom manipulations are involved the concept of subtree may change, for instance a script might implement its own scanning routine, treating an id attribute as any other attribute and leading to the concept that any non-leaf node may be the root of a subtree (that is identifying a subtree with any possible document fragment); furthermore, a possible future version of DOM Core interfaces might move the getElementById method to the Node interface, leading to the same result. Thus, a generic definition of 'subtree' (or no definition, or a definition relying upon a specific DOM feature or on script handling) might result in a variable concept with a variable scope for the ID uniqueness, but might make sense in a working draft until at least a first definition of the Web DOM Core specification, or waiting for any reason arising to restrict or enlarge the concept; otherwise, if that's been stated with a large consensus that a subtree is always a document tree, the term might be changed into the expression a document, with or without a browsing context, or (equivalently) be defined as a document subtree having a node of type document as its root (to cover the case of dynamically created documents). Otherwise, if a subtree can be either a whole document, or a document subtree detached from its owner document (i.e. a node removed from a document with its descendants, or a tree of nodes whose ownerDocument property is not defined or null), it might be defined just as such, leaving the term 'subtree' wherever it is now (but would such a manipulation be consistent with the - authoring - uniqueness rule when the subtree is inserted into an actual document?). My brain got lost partway through reading the above, so I apologise if I missed a key point you were making. Anyway, the spec now has the term home subtree, which is defined in more detail than subtree was before. I hope this helps. On Sat, 13 Dec 2008, Nils Dagsson Moskopp wrote: Am Freitag, den 12.12.2008, 20:36 +0100 schrieb Calogero Alex Baldacchino: The above (but the 'double check' I was suggesting) is about the way Firefox (2.x and 3.0.4) behaves (both href=#foo%20bar and, in a different page, href=./example.html#foo%20bar match id=foo bar), while IE7 and Opera 9.x perform an exact comparison, and show, in the address bar, an url with eventual blank spaces, thus applying the relaxation allowed by URL parsing rules, but not conforming to RFC 3986, as a complete URI string. Whenever I
Re: [whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]
Nils Dagsson Moskopp ha scritto: Am Freitag, den 12.12.2008, 20:36 +0100 schrieb Calogero Alex Baldacchino: The above (but the 'double check' I was suggesting) is about the way Firefox (2.x and 3.0.4) behaves (both href=#foo%20bar and, in a different page, href=./example.html#foo%20bar match id=foo bar), while IE7 and Opera 9.x perform an exact comparison, and show, in the address bar, an url with eventual blank spaces, thus applying the relaxation allowed by URL parsing rules, but not conforming to RFC 3986, as a complete URI string. Whenever I copypaste an URI from the address bar to any other program, I am severely annoyed by this, especially when spaces (delimiters !) are part of the fake-URI. A chat or office program, for example, is unable to highlight the fake-URI anymore, (how could it ?), also pasting it into source code can create all kind of validation errors. And whenever I get a bastardized URI via chat or mail, only a part of it is clickable. Can someone from the web browser faction please state if there is any data to support breaking RFC-compatibility ? Because as I see it, its something that makes it appear nicer, but breaks whenever URIs are to be transferred / communicated. Actually I'm not from any faction, to be honest. I think a rationale for that may be people write strange things, both in address bars and in html code, thus relaxing rules when parsing an URL is meaningful; but I think when resolving and recomposing a whole URI the strictest rules should be applied. Getting to the problem mentioned here, the robustness principle says that id=foo bar should be accepted, but nevertheless invalid - because a fragment with a space can never be part of an URI. Indeed, that's not part of an URI, but a dereferenced component: when splitting an URI into its components, there is no need to keep %-encoded characters (RFC3986 says separated components can be decoded, thus, AIUI, both href=#foo bar and id=foo bar respect to conformance rules, but when resolving #foo bar into a complete, absolute URI, the result should always look like http://example.org/something.html#foo%20bar; to be conforming). Regards, Alex -- Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f Sponsor: Proteggi la tua auto * Garanzia furto e incendio a soli 30 euro! Offerta valida fino al 31 Dicembre! Non perdere lÂ’occasione! * Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8509d=13-12
Re: [whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]
Am Samstag, den 13.12.2008, 19:09 +0100 schrieb Calogero Alex Baldacchino: Actually I'm not from any faction, to be honest. I think a rationale for that may be people write strange things, both in address bars and in html code, thus relaxing rules when parsing an URL is meaningful; but I think when resolving and recomposing a whole URI the strictest rules should be applied. Accepting weird input is not a problem here, outputting is. Try writing a valid URI into the address bar, then get an invalid displayed. Greetings -- Nils Dagsson Moskopp http://dieweltistgarnichtso.net
Re: [whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]
Nils Dagsson Moskopp ha scritto: Am Samstag, den 13.12.2008, 19:09 +0100 schrieb Calogero Alex Baldacchino: Actually I'm not from any faction, to be honest. I think a rationale for that may be people write strange things, both in address bars and in html code, thus relaxing rules when parsing an URL is meaningful; but I think when resolving and recomposing a whole URI the strictest rules should be applied. Accepting weird input is not a problem here, outputting is. Try writing a valid URI into the address bar, then get an invalid displayed. Greetings Could you make an example, please? I wasn't able to reproduce such in IE7 - Opera 9.27 (e.g., http://real.addressofasite.com/index.html#foo%20bar; wasn't changed into http://real.addressofasite.com/index.html#foo bar). Anyway, I guess you got the point. Relaxed parsing rules are for input URLs, but after parsing, a normalization and/or the resolution algorithm should be applied, and the showed URL, being absolute and complete, should conform to RFC3986. Actual resolution algorithm (section 2.5.3 of html5 spec) does not mention fragment identifiers explicitly, and, although its 10th step says Apply any relevant conformance criteria of RFC 3986 and RFC 3987, returning an error and aborting these steps if appropriate., step 9 says Apply the algorithm described in RFC 3986 section 5.2 Relative Resolution, using url as the potentially relative URI reference (R), and base as the base URI (Base): AIUI, the algorithm described in section 5.2 of rfc3986 might be applied to each component of an URI without building a complete URI (instead, leaving each part separated and held as a property of an object - a components recomposition algorithm is defined in section 5.3 of rfc3986, but that's not a 'must'); when a single component of an URI is to be handled, rfc3986 does not require %-encoding as a 'must', thus the freedom of interpretations and the different behaviors in different UAs, leading to inconsistent results when copying a URL from a UA and pasting it into another one. I think a uniform behaviour should be defined as standard (and implemented!), instead (the concern you rised about copypaste perhaps results in a further issue regarding how line breaks should be handled by parsing rules - e.g. stripped like leading and trailing characters). Regards, Alex -- Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f Sponsor: CheBanca! La prima banca che ti dà gli interessi in anticipo. * Fino al 4,70% sul Conto Deposito, zero spese e interessi subito. Aprilo! Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=7918d=14-12
Re: [whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]
Calogero Alex Baldacchino ha scritto: Maybe the above needs a further clarification. Let me start from URL parsing (and resolving) rules: after the URL is validated, it's divided into its components, but nothing is stated about normalization and/or %-encoded characters. I think that applying a somewhat normalization may be useful to parse equivalent URLs in a consistent manner, helpful when dealing with the interfaces for URL manipulation, as described in section 2.5.5, and, last but not least, an improvement in relative references matching (especially same-document references). A minimum requirement, for standardization sake, may consist of decoding any %-encoded characters in the fragment production, which are part of the unreserved production as defined in RFC 3986 with the changes defined in HTML 5 specification for URLs parsing and restricted to the Unicode ranges representing valid characters for an attribute value (those which are not prohibited neither as 'text' nor as 'character references'). This way, a character-for-character comparison between a fragment identifier and an id attribute value, which would have been equivalent but not matching without the normalization, should success most of times, because, as a consequence of the changes applied by HTML 5 current specification to the unreserved production, such characters might or might not be %-encoded in a valid URL, while an id value is likely to contain them non-encoded. After the above fragment normalization, a character-for-character comparison would fail if the id value contained any %-encoded triplet representing a decoded character, such as foo%20bar. Anyway, such may be a weird thing to deal with, since it can be the %-encoded form of foo bar, but also the decoded form of foo%2520bar. In other words, if we apply the same normalization to two complete URLs, then we compare them, the result is quite reliable, but if we start from a component (such as a fragment identifier stored in an id attribute value) it's not easy to tell whether any normalization has been applied and which one, so there are always chances for false positives or false negatives to happen. According with RFC 3986, section 4.4. Same-Document Reference, the correct interpretation of a URI as a same-document reference cannot be hold as guaranteed, thus the mismatch between, for instance, the decoded fragment identifier foo bar and the id attribute value foo%20bar, in front of (as I think) a wide majority of good matches, can be reasonable. Anyway, a kind of double check might be considered, such as: - comparing the %-unescaped fragment identifier with the ID of each element in the DOM; - upon failure, applying a %-unescape algorithm to the ID, then comparing again with the fragment identifier and, if matching, marking the element as a 'possible choice'; - upon a perfect (exact) match, without unescaping the evaluated element ID, choosing such element as the referenced document part (actually defined as the indicated part of the document in the spec) and stopping; - without any perfect match in the whole document, choosing the first 'possible choice', if any; - without any match at all, the search for the referenced document part fails. With respect to a single check for an exact match, the overall computational time should increase linearly, thus not being a performance issue. Best regards, Alex. The above (but the 'double check' I was suggesting) is about the way Firefox (2.x and 3.0.4) behaves (both href=#foo%20bar and, in a different page, href=./example.html#foo%20bar match id=foo bar), while IE7 and Opera 9.x perform an exact comparison, and show, in the address bar, an url with eventual blank spaces, thus applying the relaxation allowed by URL parsing rules, but not conforming to RFC 3986, as a complete URI string. It seems different browsers implement (more or less) different normalization/resolution algorithms, leading to different matches, thus the specification of a uniform behaviour (whatever one) might be reasonable and useful. Actual resolving algorithm, while explicitly asking for %-encoding in a path component and for conformance with RFC 3986 in general, doesn't talk about fragment identifiers; the referred algorithm for relative resolutions (section 5.2 of RFC 3986), AIUI, might not require the creation of a complete URI string, but instead be accomplished by returning an object holding a separated string for each URI part, thus not necessarily requiring %-encoding and potentially leaving out to UAs a certain degree of freedom. Furthermore, about URL decomposition attributes it is said, 'On setting, the new value must first be mutated as described by the setter preprocessor column, then mutated by %-escaping any characters in the new value that are not valid in the relevant component as given by the component column.'; such seems to refer to the stricter RFC3986
Re: [whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]
Am Freitag, den 12.12.2008, 20:36 +0100 schrieb Calogero Alex Baldacchino: The above (but the 'double check' I was suggesting) is about the way Firefox (2.x and 3.0.4) behaves (both href=#foo%20bar and, in a different page, href=./example.html#foo%20bar match id=foo bar), while IE7 and Opera 9.x perform an exact comparison, and show, in the address bar, an url with eventual blank spaces, thus applying the relaxation allowed by URL parsing rules, but not conforming to RFC 3986, as a complete URI string. Whenever I copypaste an URI from the address bar to any other program, I am severely annoyed by this, especially when spaces (delimiters !) are part of the fake-URI. A chat or office program, for example, is unable to highlight the fake-URI anymore, (how could it ?), also pasting it into source code can create all kind of validation errors. And whenever I get a bastardized URI via chat or mail, only a part of it is clickable. Can someone from the web browser faction please state if there is any data to support breaking RFC-compatibility ? Because as I see it, its something that makes it appear nicer, but breaks whenever URIs are to be transferred / communicated. Getting to the problem mentioned here, the robustness principle says that id=foo bar should be accepted, but nevertheless invalid - because a fragment with a space can never be part of an URI. So IMHO, any program should strive to accept broken URIs if they are unambigous (which they are here, because the address can hold only one URI at a time), but never output them. Greetings -- Nils Dagsson Moskopp http://dieweltistgarnichtso.net
[whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]
Calogero Alex Baldacchino ha scritto: Maybe the first is wrong, and I'm still unsure of the second. My concern is, a character-by-character comparison between an id value and a fragment identifier may fail several ways. What for href=#foo bar and id=foo bar ? Actual rules would strip the trailing space only for the href, so the matching would fail (but we might survive broken links). Escaping both, then comparing would succed, as well as first escaping then unescaping the href value before comparing (should it be pointed out, somewhere, that a fragment identifier must be unescaped before comparing to an id or a name? is it and I've missed it? - having space characters in the unreserved production means thy don't need to be escaped, but does it mean also they must be decoded from their pct-production, after parsing and for resolving?). As well, stripping the trailing spaces in both cases would succed, but would fail when comparing id=foo bar with href=#foo bar%20 (which is a valid url, according with actual parsing rules), even with escaping rules (in this case the id value trailing space must stay there). And what about id=foo%20bar in http://foo.example.org/foo.html and href=#foo bar on the same page, or on a page having the same base URL, or a base element with href=http://foo.example.org/foo.html; ? My point is, since comparisons for matching purpose happen after the URL parsing and resolution, and the id value is not involved in such steps, character-by-character comparisons may fail without a prior normalization of both th fragment-identifier an the id value (or one of them). However, if the above is yet solved with parsing and resolving rules and I've misunderstood the spec, I retire all and apologize. Or, perhaps, must a valid url with a valid fragment, which is equivalent but not exactly matching an id value, be considered as a broken link? Maybe the above needs a further clarification. Let me start from URL parsing (and resolving) rules: after the URL is validated, it's divided into its components, but nothing is stated about normalization and/or %-encoded characters. I think that applying a somewhat normalization may be useful to parse equivalent URLs in a consistent manner, helpful when dealing with the interfaces for URL manipulation, as described in section 2.5.5, and, last but not least, an improvement in relative references matching (especially same-document references). A minimum requirement, for standardization sake, may consist of decoding any %-encoded characters in the fragment production, which are part of the unreserved production as defined in RFC 3986 with the changes defined in HTML 5 specification for URLs parsing and restricted to the Unicode ranges representing valid characters for an attribute value (those which are not prohibited neither as 'text' nor as 'character references'). This way, a character-for-character comparison between a fragment identifier and an id attribute value, which would have been equivalent but not matching without the normalization, should success most of times, because, as a consequence of the changes applied by HTML 5 current specification to the unreserved production, such characters might or might not be %-encoded in a valid URL, while an id value is likely to contain them non-encoded. After the above fragment normalization, a character-for-character comparison would fail if the id value contained any %-encoded triplet representing a decoded character, such as foo%20bar. Anyway, such may be a weird thing to deal with, since it can be the %-encoded form of foo bar, but also the decoded form of foo%2520bar. In other words, if we apply the same normalization to two complete URLs, then we compare them, the result is quite reliable, but if we start from a component (such as a fragment identifier stored in an id attribute value) it's not easy to tell whether any normalization has been applied and which one, so there are always chances for false positives or false negatives to happen. According with RFC 3986, section 4.4. Same-Document Reference, the correct interpretation of a URI as a same-document reference cannot be hold as guaranteed, thus the mismatch between, for instance, the decoded fragment identifier foo bar and the id attribute value foo%20bar, in front of (as I think) a wide majority of good matches, can be reasonable. Anyway, a kind of double check might be considered, such as: - comparing the %-unescaped fragment identifier with the ID of each element in the DOM; - upon failure, applying a %-unescape algorithm to the ID, then comparing again with the fragment identifier and, if matching, marking the element as a 'possible choice'; - upon a perfect (exact) match, without unescaping the evaluated element ID, choosing such element as the referenced document part (actually defined as the indicated part of the document in the spec) and stopping; - without any