[whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]

2009-02-18 Thread Ian Hickson
On Wed, 3 Dec 2008, Calogero Alex Baldacchino wrote:
 
 My concern is, a character-by-character comparison between an id value 
 and a fragment identifier may fail several ways. What for href=#foo bar 
 and id=foo bar ? Actual rules would strip the trailing space only 
 for the href, so the matching would fail (but we might survive broken 
 links). Escaping both, then comparing would succed, as well as first 
 escaping then unescaping the href value before comparing (should it be 
 pointed out, somewhere, that a fragment identifier must be unescaped 
 before comparing to an id or a name? is it and I've missed it? - having 
 space characters in the unreserved production means thy don't need to be 
 escaped, but does it mean also they must be decoded from their 
 pct-production, after parsing and for resolving?).

The behavior specced now may change, but as it stands now unescaping is 
defined for fragment-identifier-to-id= matching.

In general, though, the behaviour is constrained by what IE does and more 
to the point by what is needed by content that depends on what IE does.

(You sent another couple of e-mails on the topic; I understand -- mostly 
-- the points you make therein, and would like to refer you to the recent 
thread on the topic:

   http://lists.w3.org/Archives/Public/public-html/2009Feb/thread.html#msg407

...where the same issues were discussed with more concrete reference to 
actual implementations and constraints placed on us by legacy content.)


  What terminology would you prefer rather than subtree? (We can't say 
  document, since we are also trying to define conformance rules for 
  disconnected subtrees handled from scripts.)
 
 Uhm, it may depend on what kinds of manipulations you have in mind, whether
 the disconnected subtree must be anyway a whole document to fulfil the
 uniqueness rule, and perhaps also on what the subtree concept might be turned
 into by future DOM Core versions, so maybe just a clarification on what a
 subtree is with respect to both the document (as a tree) and the scripts
 handling possibilities might be enough, instead of searching a new
 terminology, just to 'scope' the id visibility. I mean, if the ID matching is
 relevant for scripts accessing the matching element through the
 getElementById() method, actually a document tree is always overlapping the
 concept of subtree, and a disconnected subtree must be a document without a
 browsing context; otherwise, if other dom manipulations are involved the
 concept of subtree may change, for instance a script might implement its own
 scanning routine, treating an id attribute as any other attribute and leading
 to the concept that any non-leaf node may be the root of a subtree (that is
 identifying a subtree with any possible document fragment); furthermore, a
 possible future version of DOM Core interfaces might move the getElementById
 method to the Node interface, leading to the same result. Thus, a generic
 definition of 'subtree' (or no definition, or a definition relying upon a
 specific DOM feature or on script handling) might result in a variable concept
 with a variable scope for the ID uniqueness, but might make sense in a working
 draft until at least a first definition of the Web DOM Core specification, or
 waiting for any reason arising to restrict or enlarge the concept; otherwise,
 if that's been stated with a large consensus that a subtree is always a
 document tree, the term might be changed into the expression a document, with
 or without a browsing context, or (equivalently) be defined as a document
 subtree having a node of type document as its root (to cover the case of
 dynamically created documents). Otherwise, if a subtree can be either a whole
 document, or a document subtree detached from its owner document (i.e. a node
 removed from a document with its descendants, or a tree of nodes whose
 ownerDocument property is not defined or null), it might be defined just as
 such, leaving the term 'subtree' wherever it is now (but would such a
 manipulation be consistent with the - authoring - uniqueness rule when the
 subtree is inserted into an actual document?).

My brain got lost partway through reading the above, so I apologise if I 
missed a key point you were making.

Anyway, the spec now has the term home subtree, which is defined in more 
detail than subtree was before. I hope this helps.


On Sat, 13 Dec 2008, Nils Dagsson Moskopp wrote:
 Am Freitag, den 12.12.2008, 20:36 +0100 schrieb Calogero Alex
 Baldacchino:
 
  The above (but the 'double check' I was suggesting) is about the way 
  Firefox (2.x and 3.0.4) behaves (both href=#foo%20bar and, in a 
  different page, href=./example.html#foo%20bar match id=foo bar), 
  while IE7 and Opera 9.x perform an exact comparison, and show, in the 
  address bar, an url with eventual blank spaces, thus applying the 
  relaxation allowed by URL parsing rules, but not conforming to RFC 
  3986, as a complete URI string.

 Whenever I 

Re: [whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]

2008-12-13 Thread Calogero Alex Baldacchino

Nils Dagsson Moskopp ha scritto:

Am Freitag, den 12.12.2008, 20:36 +0100 schrieb Calogero Alex
Baldacchino:
  
The above (but the 'double check' I was suggesting) is about the way 
Firefox (2.x and 3.0.4) behaves (both href=#foo%20bar and, in a 
different page, href=./example.html#foo%20bar match id=foo bar), 
while IE7 and Opera 9.x perform an exact comparison, and show, in the 
address bar, an url with eventual blank spaces, thus applying the 
relaxation allowed by URL parsing rules, but not conforming to RFC 3986, 
as a complete URI string.


Whenever I copypaste an URI from the address bar to any other program, I
am severely annoyed by this, especially when spaces (delimiters !) are
part of the fake-URI. A chat or office program, for example, is unable
to highlight the fake-URI anymore, (how could it ?), also pasting it
into source code can create all kind of validation errors. And whenever
I get a bastardized URI via chat or mail, only a part of it is
clickable.

Can someone from the web browser faction please state if there is any
data to support breaking RFC-compatibility ? Because as I see it, its
something that makes it appear nicer, but breaks whenever URIs are to be
transferred / communicated.
  


Actually I'm not from any faction, to be honest. I think a rationale for 
that may be people write strange things, both in address bars and in 
html code, thus relaxing rules when parsing an URL is meaningful; but I 
think when resolving and recomposing a whole URI the strictest rules 
should be applied.



Getting to the problem mentioned here, the robustness principle says
that id=foo bar should be accepted, but nevertheless invalid - because
a fragment with a space can never be part of an URI.


Indeed, that's not part of an URI, but a dereferenced component: when 
splitting an URI into its components, there is no need to keep %-encoded 
characters (RFC3986 says separated components can be decoded, thus, 
AIUI, both href=#foo bar and id=foo bar respect to conformance 
rules, but when resolving #foo bar into a complete, absolute URI, the 
result should always look like 
http://example.org/something.html#foo%20bar; to be conforming).



Regards,
Alex


--
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP 
autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
Proteggi la tua auto
* Garanzia furto e incendio a soli 30 euro! Offerta valida fino al 31 Dicembre! 
Non perdere lÂ’occasione!
* 
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8509d=13-12


Re: [whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]

2008-12-13 Thread Nils Dagsson Moskopp
Am Samstag, den 13.12.2008, 19:09 +0100 schrieb Calogero Alex
Baldacchino:
 Actually I'm not from any faction, to be honest. I think a rationale for 
 that may be people write strange things, both in address bars and in 
 html code, thus relaxing rules when parsing an URL is meaningful; but I 
 think when resolving and recomposing a whole URI the strictest rules 
 should be applied.
Accepting weird input is not a problem here, outputting is. Try writing
a valid URI into the address bar, then get an invalid displayed.


Greetings
-- 
Nils Dagsson Moskopp
http://dieweltistgarnichtso.net



Re: [whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]

2008-12-13 Thread Calogero Alex Baldacchino

Nils Dagsson Moskopp ha scritto:

Am Samstag, den 13.12.2008, 19:09 +0100 schrieb Calogero Alex
Baldacchino:
  
Actually I'm not from any faction, to be honest. I think a rationale for 
that may be people write strange things, both in address bars and in 
html code, thus relaxing rules when parsing an URL is meaningful; but I 
think when resolving and recomposing a whole URI the strictest rules 
should be applied.


Accepting weird input is not a problem here, outputting is. Try writing
a valid URI into the address bar, then get an invalid displayed.


Greetings
  


Could you make an example, please? I wasn't able to reproduce such in 
IE7 - Opera 9.27 (e.g., 
http://real.addressofasite.com/index.html#foo%20bar; wasn't changed 
into http://real.addressofasite.com/index.html#foo bar).


Anyway, I guess you got the point. Relaxed parsing rules are for input 
URLs, but after parsing, a normalization and/or the resolution algorithm 
should be applied, and the showed URL, being absolute and complete, 
should conform to RFC3986. Actual resolution algorithm (section 2.5.3 of 
html5 spec) does not mention fragment identifiers explicitly, and, 
although its 10th step says Apply any relevant conformance criteria of 
RFC 3986 and RFC 3987, returning an error and aborting these steps if 
appropriate., step 9 says Apply the algorithm described in RFC 3986 
section 5.2 Relative Resolution, using url as the potentially relative 
URI reference (R), and base as the base URI (Base): AIUI, the algorithm 
described in section 5.2 of rfc3986 might be applied to each component 
of an URI without building a complete URI (instead, leaving each part 
separated and held as a property of an object - a components 
recomposition algorithm is defined in section 5.3 of rfc3986, but that's 
not a 'must'); when a single component of an URI is to be handled, 
rfc3986 does not require %-encoding as a 'must', thus the freedom of 
interpretations and the different behaviors in different UAs, leading to 
inconsistent results when copying a URL from a UA and pasting it into 
another one. I think a uniform behaviour should be defined as standard 
(and implemented!), instead (the concern you rised about copypaste 
perhaps results in a further issue regarding how line breaks should be 
handled by parsing rules - e.g. stripped like leading and trailing 
characters).


Regards,
Alex


--
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP 
autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
CheBanca! La prima banca che ti dà gli interessi in anticipo.
* Fino al 4,70% sul Conto Deposito, zero spese e interessi subito. Aprilo!
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=7918d=14-12


Re: [whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]

2008-12-12 Thread Calogero Alex Baldacchino

Calogero Alex Baldacchino ha scritto:
Maybe the above needs a further clarification. Let me start from URL 
parsing (and resolving) rules: after the URL is validated, it's 
divided into its components, but nothing is stated about normalization 
and/or %-encoded characters. I think that applying a somewhat 
normalization may be useful to parse equivalent URLs in a consistent 
manner, helpful when dealing with the interfaces for URL manipulation, 
as described in section 2.5.5, and, last but not least, an improvement 
in relative references matching (especially same-document references). 
A minimum requirement, for standardization sake, may consist of 
decoding any %-encoded characters in the fragment production, which 
are part of the unreserved production as defined in RFC 3986 with 
the changes defined in HTML 5 specification for URLs parsing and 
restricted to the Unicode ranges representing valid characters for an 
attribute value (those which are not prohibited neither as 'text' nor 
as 'character references'). This way, a character-for-character 
comparison between a fragment identifier and an id attribute value, 
which would have been equivalent but not matching without the 
normalization, should success most of times, because, as a consequence 
of the changes applied by HTML 5 current specification to the 
unreserved production, such characters might or might not be 
%-encoded in a valid URL, while an id value is likely to contain them 
non-encoded.


After the above fragment normalization, a character-for-character 
comparison would fail if the id value contained any %-encoded triplet 
representing a decoded character, such as foo%20bar. Anyway, such 
may be a weird thing to deal with, since it can be the %-encoded form 
of foo bar, but also the decoded form of foo%2520bar. In other 
words, if we apply the same normalization to two complete URLs, then 
we compare them, the result is quite reliable, but if we start from a 
component (such as a fragment identifier stored in an id attribute 
value) it's not easy to tell whether any normalization has been 
applied and which one, so there are always chances for false positives 
or false negatives to happen. According with RFC 3986, section 4.4. 
Same-Document Reference, the correct interpretation of a URI as a 
same-document reference cannot be hold as guaranteed, thus the 
mismatch between, for instance, the  decoded fragment identifier foo 
bar and the id attribute value foo%20bar, in front of (as I think) 
a wide majority of good matches, can be reasonable. Anyway, a kind of 
double check might be considered, such as:


- comparing the %-unescaped fragment identifier with the ID of each 
element in the DOM;
- upon failure, applying a %-unescape algorithm to the ID, then 
comparing again with the fragment identifier and, if matching, marking 
the element as a 'possible choice';
- upon a perfect (exact) match, without unescaping the evaluated 
element ID, choosing such element as the referenced document part 
(actually defined as the indicated part of the document in the spec) 
and stopping;
- without any perfect match in the whole document, choosing the first 
'possible choice', if any;
- without any match at all, the search for the referenced document 
part fails.


With respect to a single check for an exact match, the overall 
computational time should increase linearly, thus not being a 
performance issue.


Best regards, Alex.


The above (but the 'double check' I was suggesting) is about the way 
Firefox (2.x and 3.0.4) behaves (both href=#foo%20bar and, in a 
different page, href=./example.html#foo%20bar match id=foo bar), 
while IE7 and Opera 9.x perform an exact comparison, and show, in the 
address bar, an url with eventual blank spaces, thus applying the 
relaxation allowed by URL parsing rules, but not conforming to RFC 3986, 
as a complete URI string. It seems different browsers implement (more or 
less) different normalization/resolution algorithms, leading to 
different matches, thus the specification of a uniform behaviour 
(whatever one) might be reasonable and useful. Actual resolving 
algorithm, while explicitly asking for %-encoding in a path component 
and for conformance with RFC 3986 in general, doesn't talk about 
fragment identifiers; the referred algorithm for relative resolutions 
(section 5.2 of RFC 3986), AIUI, might not require the creation of a 
complete URI string, but instead be accomplished by returning an object 
holding a separated string for each URI part, thus not necessarily 
requiring %-encoding and potentially leaving out to UAs a certain degree 
of freedom. Furthermore, about URL decomposition attributes it is said, 
'On setting, the new value must first be mutated as described by the 
setter preprocessor column, then mutated by %-escaping any characters 
in the new value that are not valid in the relevant component as given 
by the component column.'; such seems to refer to the stricter RFC3986 

Re: [whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]

2008-12-12 Thread Nils Dagsson Moskopp
Am Freitag, den 12.12.2008, 20:36 +0100 schrieb Calogero Alex
Baldacchino:
 The above (but the 'double check' I was suggesting) is about the way 
 Firefox (2.x and 3.0.4) behaves (both href=#foo%20bar and, in a 
 different page, href=./example.html#foo%20bar match id=foo bar), 
 while IE7 and Opera 9.x perform an exact comparison, and show, in the 
 address bar, an url with eventual blank spaces, thus applying the 
 relaxation allowed by URL parsing rules, but not conforming to RFC 3986, 
 as a complete URI string.
Whenever I copypaste an URI from the address bar to any other program, I
am severely annoyed by this, especially when spaces (delimiters !) are
part of the fake-URI. A chat or office program, for example, is unable
to highlight the fake-URI anymore, (how could it ?), also pasting it
into source code can create all kind of validation errors. And whenever
I get a bastardized URI via chat or mail, only a part of it is
clickable.

Can someone from the web browser faction please state if there is any
data to support breaking RFC-compatibility ? Because as I see it, its
something that makes it appear nicer, but breaks whenever URIs are to be
transferred / communicated.

Getting to the problem mentioned here, the robustness principle says
that id=foo bar should be accepted, but nevertheless invalid - because
a fragment with a space can never be part of an URI. So IMHO, any
program should strive to accept broken URIs if they are unambigous
(which they are here, because the address can hold only one URI at a
time), but never output them.


Greetings
-- 
Nils Dagsson Moskopp
http://dieweltistgarnichtso.net



[whatwg] URL parsing and same-document references [was: Re: Citing multiple blockquote elements in HTML5]

2008-12-04 Thread Calogero Alex Baldacchino
Calogero Alex Baldacchino ha scritto:   


Maybe the first is wrong, and I'm still unsure of the second. My 
concern is, a character-by-character comparison between an id value 
and a fragment identifier may fail several ways. What for href=#foo 
bar  and id=foo bar ? Actual rules would strip the trailing space 
only for the href, so the matching would fail (but we might survive 
broken links). Escaping both, then comparing would succed, as well as 
first escaping then unescaping the href value before comparing (should 
it be pointed out, somewhere, that a fragment identifier must be 
unescaped before comparing to an id or a name? is it and I've missed 
it? - having space characters in the unreserved production means thy 
don't need to be escaped, but does it mean also they must be decoded 
from their pct-production, after parsing and for resolving?). As well, 
stripping the trailing spaces in both cases would succed, but would 
fail when comparing id=foo bar  with href=#foo bar%20 (which is a 
valid url, according with actual parsing rules), even with escaping 
rules (in this case the id value trailing space must stay there). And 
what about id=foo%20bar in http://foo.example.org/foo.html  and  
href=#foo bar on the same page, or on a page having the same base 
URL, or a base element with href=http://foo.example.org/foo.html; ? 
My point is, since comparisons for matching purpose happen after the 
URL parsing and resolution, and the id value is not involved in such 
steps, character-by-character comparisons may fail without a prior 
normalization of both th fragment-identifier an the id value (or one 
of them). However, if the above is yet solved with parsing and 
resolving rules and I've misunderstood the spec, I retire all and 
apologize. Or, perhaps, must a valid url with a valid fragment, which 
is equivalent but not exactly matching an id value, be considered as a 
broken link?


Maybe the above needs a further clarification. Let me start from URL 
parsing (and resolving) rules: after the URL is validated, it's divided 
into its components, but nothing is stated about normalization and/or 
%-encoded characters. I think that applying a somewhat normalization may 
be useful to parse equivalent URLs in a consistent manner, helpful when 
dealing with the interfaces for URL manipulation, as described in 
section 2.5.5, and, last but not least, an improvement in relative 
references matching (especially same-document references). A minimum 
requirement, for standardization sake, may consist of decoding any 
%-encoded characters in the fragment production, which are part of the 
unreserved production as defined in RFC 3986 with the changes defined 
in HTML 5 specification for URLs parsing and restricted to the Unicode 
ranges representing valid characters for an attribute value (those which 
are not prohibited neither as 'text' nor as 'character references'). 
This way, a character-for-character comparison between a fragment 
identifier and an id attribute value, which would have been equivalent 
but not matching without the normalization, should success most of 
times, because, as a consequence of the changes applied by HTML 5 
current specification to the unreserved production, such characters 
might or might not be %-encoded in a valid URL, while an id value is 
likely to contain them non-encoded.


After the above fragment normalization, a character-for-character 
comparison would fail if the id value contained any %-encoded triplet 
representing a decoded character, such as foo%20bar. Anyway, such may 
be a weird thing to deal with, since it can be the %-encoded form of 
foo bar, but also the decoded form of foo%2520bar. In other words, 
if we apply the same normalization to two complete URLs, then we compare 
them, the result is quite reliable, but if we start from a component 
(such as a fragment identifier stored in an id attribute value) it's not 
easy to tell whether any normalization has been applied and which one, 
so there are always chances for false positives or false negatives to 
happen. According with RFC 3986, section 4.4. Same-Document Reference, 
the correct interpretation of a URI as a same-document reference cannot 
be hold as guaranteed, thus the mismatch between, for instance, the  
decoded fragment identifier foo bar and the id attribute value 
foo%20bar, in front of (as I think) a wide majority of good matches, 
can be reasonable. Anyway, a kind of double check might be considered, 
such as:


- comparing the %-unescaped fragment identifier with the ID of each 
element in the DOM;
- upon failure, applying a %-unescape algorithm to the ID, then 
comparing again with the fragment identifier and, if matching, marking 
the element as a 'possible choice';
- upon a perfect (exact) match, without unescaping the evaluated element 
ID, choosing such element as the referenced document part (actually 
defined as the indicated part of the document in the spec) and stopping;
- without any