Re: URI Comparisons: RFC 2616 vs. RDF
On Tue, 18 Jan 2011 21:43:08 -0600 Peter DeVries pete.devr...@gmail.com wrote: I have URI's where case is important only at the terminal identifier. (HTML URI's in this example) http://lod.taxonconcept.org/ses/v6n7p.html should be different than http://lod.taxonconcept.org/ses/v6N7p.html Am I correct in thinking that this is OK? Yes, HTTP URIs are case-sensitive apart from the scheme (http), host (lod.taxonconcept.org) and percent-escaped characters (e.g. %7e vs %7E). Any URI canonicalisation tool that treats the above two URIs as the same is plain broken. -- Toby A Inkster mailto:m...@tobyinkster.co.uk http://tobyinkster.co.uk
Re: URI Comparisons: RFC 2616 vs. RDF
On 1/22/11 8:27 AM, Toby Inkster wrote: On Tue, 18 Jan 2011 21:43:08 -0600 Peter DeVriespete.devr...@gmail.com wrote: I have URI's where case is important only at the terminal identifier. (HTML URI's in this example) http://lod.taxonconcept.org/ses/v6n7p.html should be different than http://lod.taxonconcept.org/ses/v6N7p.html Am I correct in thinking that this is OK? Yes, HTTP URIs are case-sensitive apart from the scheme (http), host (lod.taxonconcept.org) and percent-escaped characters (e.g. %7e vs %7E). Any URI canonicalisation tool that treats the above two URIs as the same is plain broken. Amen! A URI is an Identifier. The fact that it can be used to Identify a Data Source i.e., an Address via HTTP scheme that provides actual access to Data doesn't negate the fact that it's fundamentally an Identifier. The fact that the Web has manifested back to front (URLs usage before URI groking) doesn't mean everything has to follow this warped pattern. The Web is part of a technology continuum. Computing did exist before the WWW became ubiquitous. -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: URI Comparisons: RFC 2616 vs. RDF
Harry Halpin wrote: On Thu, Jan 20, 2011 at 11:15 AM, Nathan nat...@webr3.org wrote: Out of interest, where is that process defined? I was looking for it the other day - for instance in the quoted specification we have the example: edi:price xmlns:edi='http://ecommerce.example.org/schema' units='Euro'32.18/edi:price Where's the bit of the XML specification which says you join them up by concatenating 'http://ecommerce.example.org/schema' with #(?assumed?) and 'Euro' to get 'http://ecommerce.example.org/schema#Euro'? Actually you don't. A namespace is just that - a tuple (namespace, localname) in XML. That's why namespaces in XML are far all intents and purposes broken and why, to a large extent, Web browser developers in HTML stopped using them and hate implementing them in the DOM, and so refuse to have them in HTML5. And that's one reason RDF(A) will probably continue getting a sort of bad rap in the HTML world, as prefixes are not associated with just making URIs, but with this terrible namespace tuple. For an archeology of the relevant standards, check out Section What Namespaces Do of this paper. While the paper is focussed on why namespace documents are a mess, the relevant information is in that section and extensively referenced, with examples: http://xml.coverpages.org/HHalpinXMLVS-Extreme.html Ahh, thanks for explaining that one Harry, most helpful :) Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
Alan Ruttenberg wrote: On Wed, Jan 19, 2011 at 4:45 PM, Nathan nat...@webr3.org wrote: David Wood wrote: On Jan 19, 2011, at 10:59, Nathan wrote: ps: as an illustration of how engrained URI normalization is, I've capitalized the domain names in the to: and cc: fields, I do hope the mail still come through, and hope that you'll accept this email as being sent to you. Hopefully we'll also find this mail in the archives shortly at htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing. Heh. OK, I'll bite. Domain names in email addressing are defined in IETF RFC 2822 (and its predecessor RFC 822), which defers the interpretation to RFC 1035 (Domain names - implementation and specification). RFC 1035 section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP, are to be compared in a case-insensitive manner. As far as I know, the W3C specs do not so refer to RFC 1035. And I'll bite in the other direction, why not treat URIs as URIs? why go against both the RDF Specification [1] and the URI specification when they say /not/ to encode permitted US-ASCII characters (like ~ %7E)? why force case-sensitive matching on the scheme and domain on URIs matching the generic syntax when the specs say must be compared case insensitively? and so on and so forth. [AR] Which specs? The various URI/IRI specs and previous revisions of. http://www.w3.org/TR/REC-xml-names/#NSNameComparison URI references identifying namespaces .. In a namespace declaration, the URI reference is .. The URI references below are all different for the purposes of identifying namespaces .. The URI references below are also all different for the purposes of identifying namespaces .. So here is another spec that *explicitly* disagrees with the idea that URI normalization should be a built-in processing. As far as I can see, that's only for a URI reference used within a namespace, and does not govern usage or normalization when you join the URI reference up with the local name to make the full URI. Out of interest, where is that process defined? I was looking for it the other day - for instance in the quoted specification we have the example: edi:price xmlns:edi='http://ecommerce.example.org/schema' units='Euro'32.18/edi:price Where's the bit of the XML specification which says you join them up by concatenating 'http://ecommerce.example.org/schema' with #(?assumed?) and 'Euro' to get 'http://ecommerce.example.org/schema#Euro'? And finally, this is why I specifically asked if the non-normalization of RDF URI References had XML Namespace heritage, which had then filtered down through OWL, SPARQL and RIF. [AR] More to document, please: Which data is being junked and scrapped? will document, but essentially every statement made using a non normalized URI when other statements are also being made about the same resource using normalized URIs - the two most common cases for this will be when people are using CMS systems and enter their domain name as uppercase in some admin, only to have that filter through to URIs in serialized RDF/RDFa, and where bugs in software have led to inconsistent URIs over time (for instance where % encoding has been fixed, or a :80 has been removed from a URI). [AR] Hmm. Are you suggesting that the behavior of libraries and clients should have precedence over specification? My view is that one first looks to specifications, and then only if specifications are poor or do not speak to the issue do we look at existing behavior. Yes I am, that specification should standardize the behaviour of libraries and clients - the level of normalization in URIs published, consumed or used by these tools is often determined by non sem web stack components, and the sem web components are blocked from normalizing these should-not-be-differing-URIs by the sem web specifications. [AR] I think there are many ways to lose in this scenario. For instance, if the server redirects then the base is the last in the chain of redirects. http://tools.ietf.org/html/rfc3986#page-29, 5.1.3. Base URI from the Retrieval URI. My conclusion - don't engineer this way. That would be my conclusion too, but as RDF(a) moves in to the realms of the CMS systems and out of the hands of the sem web community, it will be increasingly engineered this way, it's a very common pattern when working with (X)HTML (allows people to test locally or on dev servers without changing the content). Further, essentially all RDFa ever encountered by a browser has the casing on all URIs in href and src, and all these which are resolved, automatically normalized - so even if you set the base to htTp://EXAMPLE.org/ or use it in a URI, browser tools, extensions, and js based libraries will only ever see the normalized URIs (and thus be incompatible with the rest
Re: URI Comparisons: RFC 2616 vs. RDF
On 1/19/11 11:27 PM, Alan Ruttenberg wrote: On Wed, Jan 19, 2011 at 11:11 AM, Kingsley Idehen kide...@openlinksw.com mailto:kide...@openlinksw.com wrote: On 1/19/11 10:59 AM, Nathan wrote: htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing. Okay for Data Source Address Ref. (URL), no good for Entity (Data Item or Data Object) Name Ref., bar system specific handling via IFP property or owl:sameAs :-) Kingsley, same for you as Nathan. To what specification do you refer to for the definitions and behavior of: - Data source address ref - Entity - Statement. -Alan Alan, My response is purely about managing Identifiers that are used as functional unambiguous Name or Address References. Not quoting a W3C spec. Basically, expressing a view based on my understanding of what's practical. A system (e.g. a database or client app.) can (should) make a decision about how it handles resolvable Identifiers when used as Name or Address references. Kingsley -- Regards, Kingsley Idehen President CEO OpenLink Software Web:http://www.openlinksw.com Weblog:http://www.openlinksw.com/blog/~kidehen http://www.openlinksw.com/blog/%7Ekidehen Twitter/Identi.ca: kidehen -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: URI Comparisons: RFC 2616 vs. RDF
On Wed, 2011-01-19 at 21:45 +, Nathan wrote: David Wood wrote: On Jan 19, 2011, at 10:59, Nathan wrote: ps: as an illustration of how engrained URI normalization is, I've capitalized the domain names in the to: and cc: fields, I do hope the mail still come through, and hope that you'll accept this email as being sent to you. Hopefully we'll also find this mail in the archives shortly at htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing. Heh. OK, I'll bite. Domain names in email addressing are defined in IETF RFC 2822 (and its predecessor RFC 822), which defers the interpretation to RFC 1035 (Domain names - implementation and specification). RFC 1035 section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP, are to be compared in a case-insensitive manner. As far as I know, the W3C specs do not so refer to RFC 1035. And I'll bite in the other direction, why not treat URIs as URIs? It seems to me the underlying question here is whether aliasing of URIs (whether they dereference to the same resource) should imply semantic equality (i.e. use as an identifier in a web logic language like RDF or OWL). The position so far in RDF, OWL and RIF has been no As far as the specifications for those languages are concerned a URI is just a convenient spelling for an identifier and they require comparison of identifiers to be stable and context-independent. Those specs don't constrain what you get back from dereferencing some URI U to include statements about U. The URI spec (rfc3986[1]) does allow this usage. In particular Section 6 Normalization and Comparison says: URI comparison is performed for some particular purpose. Protocols or implementations that compare URIs for different purposes will often be subject to differing design trade-offs in regards to how much effort should be spent in reducing aliased identifiers. This section describes various methods that may be used to compare URIs, the trade-offs between them, and the types of applications that might use them. and We use the terms different and equivalent to describe the possible outcomes of such comparisons, but there are many application-dependent versions of equivalence. While RDF predates this spec it seems to me that the RDF usage remains consistent with it. The purpose of comparison in RDF is different from that of cache retrieval of web pages or message delivery of email. This quote also makes clear that there is no single definitive normalization. There are different levels of normalization possible depending on your needs. Earlier you pointed out that the place where the URI specs and RDF do collide is in resolving relative URIs into absolute URIs. Again rfc3986 does not preclude the RDF usage. Section 5.2.1 says: Normalization of the base URI, as described in Sections 6.2.2 and 6.2.3, is optional. So I claim that in terms of formal published specifications: (1) RDF, OWL and RIF do not require any normalization of URIs (beyond the character encoding level) and compare URIs by simple string comparison. (2) This usage is *not* precluded by the URI specs, at least by 3986 which sets the current framework for the application of scheme-specific specs. ** Now we turn to linked data ... As we've already mentioned :) there are no specs for linked data so we move onto more subjective grounds. The linked data convention is that dereferencing some URI U in your RDF document should return information about U, including further onward links. So if data set A spells a URI hTTp://example.com/foo but the data you get from dereferencing that URI talks only about http://example.com/foo then someone has a problem somewhere. The question is who, where and how to fix it. It seems to me that this is primarily a issue with publishing, and a little about being sensible about how you pass on links. If I'm going to put up some linked data I should mint normalized URIs; I should use the same spelling of the URIs throughout my data; I'll make sure those URIs dereference and that the data that comes back is stable and useful. If someone else refers to my resources using an aliased URI (such as a different case for the protocol) and makes statements about those aliases then they have simply made a mistake. To make sure that dereference returns what I expect, independent of aliasing, then I should publish data with explicit base URIs (or just absolute URIs). Publishing with relative URIs and no base is a recipe for having your data look different from different places. Just don't do it. No surprise there. None of this requires us to force URI normalization into the heart of identifier comparison in RDF itself. It is not a necessary solution and it is not a sufficient one because there is no universal
Re: URI Comparisons: RFC 2616 vs. RDF
Hi Dave, Generally I agree, will address a few specific points in line (just to address them) then summarize my intended goals at the end (being the substance of the mail). Dave Reynolds wrote: The URI spec (rfc3986[1]) does allow this usage. In particular Section 6 Normalization and Comparison says: URI comparison is performed for some particular purpose. Protocols or implementations that compare URIs for different purposes will often be subject to differing design trade-offs in regards to how much effort should be spent in reducing aliased identifiers. This section describes various methods that may be used to compare URIs, the trade-offs between them, and the types of applications that might use them. and We use the terms different and equivalent to describe the possible outcomes of such comparisons, but there are many application-dependent versions of equivalence. While RDF predates this spec it seems to me that the RDF usage remains consistent with it. The purpose of comparison in RDF is different from that of cache retrieval of web pages or message delivery of email. Indeed, I also read though: For all URIs, the hexadecimal digits within a percent-encoding triplet (e.g., %3a versus %3A) are case-insensitive and therefore should be normalized to use uppercase letters for the digits A-F. When a URI uses components of the generic syntax, the component syntax equivalence rules always apply; namely, that the scheme and host are case-insensitive and therefore should be normalized to lowercase... - http://tools.ietf.org/html/rfc3986#section-6.2.2.1 And took the For all and always to literally mean for all and always. Unsure where this leaves things, and which takes precedence. This quote also makes clear that there is no single definitive normalization. There are different levels of normalization possible depending on your needs. agree So I claim that in terms of formal published specifications: (1) RDF, OWL and RIF do not require any normalization of URIs (beyond the character encoding level) and compare URIs by simple string comparison. One potential issue on the % encoding, clarified further down. (2) This usage is *not* precluded by the URI specs, at least by 3986 which sets the current framework for the application of scheme-specific specs. Not a 100% sure but tempted to agree with you, would make sense not to preclude it. As we've already mentioned :) there are no specs for linked data so we move onto more subjective grounds. Would be nice to get some specs at some point... The linked data convention is that dereferencing some URI U in your RDF document should return information about U, including further onward links. So if data set A spells a URI hTTp://example.com/foo but the data you get from dereferencing that URI talks only about http://example.com/foo then someone has a problem somewhere. The question is who, where and how to fix it. agree, good way of putting it. against both the RDF Specification [1] and the URI specification when they say /not/ to encode permitted US-ASCII characters (like ~ %7E)? Where did that example come from? The encoding consists of... %-escaping octets that do not correspond to permitted US-ASCII characters. - http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers. - http://tools.ietf.org/html/rfc3986#section-2.3 I read those quotes as saying do not encode permitted US-ASCII characters in RDF URI References. At what point have we suggested doing that? As above why force case-sensitive matching on the scheme and domain on URIs matching the generic syntax when the specs say must be compared case insensitively? No, the specs do not say that, see above. See for all and always quote earlier on. So use normalized URIs in the first place. ... RDF/OWL/RIF aren't designed the way they are because someone thought it would be a good idea to allow such things to be used side by side or because they *want* people to use denormalized URIs. ... The point is that there is no single, simple, universal (i.e. across all schemes) normalization algorithm that could be used. The current approach gives stable, well-defined behaviour which doesn't change as people invent new URI schemes. The RDF serializations give you enough control to enable you to be certain about what URI you are talking about. Job done. Okay, I agree, and I'm really not looking to create a lot of work here, the general gist of what I'm hoping for is along the lines of: RDF Publishers MUST perform Case Normalization and Percent-Encoding Normalization on all
Re: URI Comparisons: RFC 2616 vs. RDF
On Thu, 2011-01-20 at 13:08 +, Dave Reynolds wrote: [ . . . ] It seems to me that this is primarily a issue with publishing, and a little about being sensible about how you pass on links. If I'm going to put up some linked data I should mint normalized URIs; I should use the same spelling of the URIs throughout my data; I'll make sure those URIs dereference and that the data that comes back is stable and useful. If someone else refers to my resources using an aliased URI (such as a different case for the protocol) and makes statements about those aliases then they have simply made a mistake. To make sure that dereference returns what I expect, independent of aliasing, then I should publish data with explicit base URIs (or just absolute URIs). Publishing with relative URIs and no base is a recipe for having your data look different from different places. Just don't do it. This advice sounds like an excellent candidate for publication in a best practices document. And if it is merely best practice guidance, perhaps that *is* something that the new RDF working group could address. -- David Booth, Ph.D. http://dbooth.org/ Opinions expressed herein are those of the author and do not necessarily reflect those of his employer.
Re: URI Comparisons: RFC 2616 vs. RDF
David Booth wrote: On Thu, 2011-01-20 at 13:08 +, Dave Reynolds wrote: [ . . . ] It seems to me that this is primarily a issue with publishing, and a little about being sensible about how you pass on links. If I'm going to put up some linked data I should mint normalized URIs; I should use the same spelling of the URIs throughout my data; I'll make sure those URIs dereference and that the data that comes back is stable and useful. If someone else refers to my resources using an aliased URI (such as a different case for the protocol) and makes statements about those aliases then they have simply made a mistake. To make sure that dereference returns what I expect, independent of aliasing, then I should publish data with explicit base URIs (or just absolute URIs). Publishing with relative URIs and no base is a recipe for having your data look different from different places. Just don't do it. This advice sounds like an excellent candidate for publication in a best practices document. And if it is merely best practice guidance, perhaps that *is* something that the new RDF working group could address. +1 from me, address at the publishing phase, allow at the consuming phase, keep comparison simple.
Re: URI Comparisons: RFC 2616 vs. RDF
* [2011-01-20 14:29:35 +] Nathan nat...@webr3.org écrit: ] RDF Publishers MUST perform Case Normalization and Percent-Encoding ] Normalization on all URIs prior to publishing. When using relative URIs ] publishers SHOULD include a well defined base using a serialization ] specific mechanism. Publishers are advised to perform additional ] normalization steps as specified by URI (RFC 3986) where possible. ] ] RDF Consumers MAY normalize URIs they encounter and SHOULD perform ] Case Normalization and Percent-Encoding Normalization. ] ] Two RDF URIs are equal if and only if they compare as equal, ] character by character, as Unicode strings. ] ] For many reasons it would be good to solve this at the publishing phase, ] allow normalization at the consuming phase (can't be precluded as ] intermediary components may normalize), and keep simple case sensitive ] string comparison throughout the stack and specs (so implementations ] remain simple and fast.) ] ] Does anybody find the above disagreeable? Sounds about right to me, but what about port numbers, http://example.org/ vs http://example.org:80/? -w -- William Waitesmailto:w...@styx.org http://eris.okfn.org/ww/ sip:w...@styx.org F4B3 39BF E775 CF42 0BAB 3DF0 BE40 A6DF B06F FD45
Re: URI Comparisons: RFC 2616 vs. RDF
Hi: On 20.01.2011, at 15:40, Nathan wrote: David Booth wrote: On Thu, 2011-01-20 at 13:08 +, Dave Reynolds wrote: [ . . . ] To make sure that dereference returns what I expect, independent of aliasing, then I should publish data with explicit base URIs (or just absolute URIs). Publishing with relative URIs and no base is a recipe for having your data look different from different places. Just don't do it. This advice sounds like an excellent candidate for publication in a best practices document. And if it is merely best practice guidance, perhaps that *is* something that the new RDF working group could address. +1 from me, address at the publishing phase, allow at the consuming phase, keep comparison simple. I am not sure whether you are also talking of RDFa, but in case you do, I would like to add the following: Our experiences with helping about 2,000 sites with adding GoodRelations via our form-based tools shows that 1. RDFa is in many cases the only viable way for people to publish RDF 2. They can often not control and not even predict the exact URI of the page that will contain the markup (imagine uncool URIs loaded with parameters etc.) In those scenarios, relative URIs are essential. We even recommend that people include an empty div rel=foaf:page resource=/div at the proper position in the nesting so that there will be a link between the data entity and the page that contains it. Martin
Re: URI Comparisons: RFC 2616 vs. RDF
Martin Hepp wrote: On 20.01.2011, at 15:40, Nathan wrote: David Booth wrote: On Thu, 2011-01-20 at 13:08 +, Dave Reynolds wrote: [ . . . ] To make sure that dereference returns what I expect, independent of aliasing, then I should publish data with explicit base URIs (or just absolute URIs). Publishing with relative URIs and no base is a recipe for having your data look different from different places. Just don't do it. This advice sounds like an excellent candidate for publication in a best practices document. And if it is merely best practice guidance, perhaps that *is* something that the new RDF working group could address. +1 from me, address at the publishing phase, allow at the consuming phase, keep comparison simple. I am not sure whether you are also talking of RDFa, but in case you do, I would like to add the following: Hi Martin, Yes (re RDFa), see: http://webr3.org/urinorm/2 - all the browsers do the normalization so you can't even get to the non-normalized URI. in a browser you'll note that all the URIs get normalized automatically, in that it's impossible to programmatically access the correct casing. That's a problem. if you run it through the RDFa distiller at w3.org [2] you'll find: htTp://WEBR3.org/urinorm/2 dc:creator http://WEBR3.org/nathan#me . http://WEBR3.org/urinorm/2#example dc:title URI Normalization Example 2 . note one of the URIs (the one which required relative path resolution) has the scheme normalised. if you run if through check.rdfa.info you'll find that all the URIs are normalized. [3] if you run it through sigma [4] you'll find everything has been normalized. You can also see an RDF view of this [5] if you run it through URI Burner [6], you'll find that /some/ URIs have been normalized. It's also worth noting that this caused all kinds of problems - I ended up having to create a new resource at this point w/ some RDF N3 to test URI Burner: http://webr3.org/urinorm/3 which lead to the empty [7] then I figured I'd try [8] and if you click the creator ( htTp://WEBR3.org/nathan#me ) since in this case there's no normalization (not it was normalized in [6]) you get a 400 Bad Request [9]. and so on and so forth - far from ideal. Best, Nathan [1] http://www.rdfabout.com/demo/validator/ (normalizes all RDF URIs) [2] http://www.w3.org/2007/08/pyRdfa/ [3] http://check.rdfa.info/check?url=http://webr3.org/urinorm/2version=1.0 [4] http://sig.ma/search?q=http://webr3.org/urinorm/2 [5] http://sig.ma/entity/e6a2c8319bb3bf21f4b4639216f114a4.rdf#this [6] http://linkeddata.uriburner.com/about/html/http/webr3.org/urinorm/2%01this [7] http://linkeddata.uriburner.com/about/html/http/webr3.org/urinorm/3 [8] http://linkeddata.uriburner.com/about/html/htTp://WEBR3.org/urinorm/3 [9] http://linkeddata.uriburner.com/about/html/htTp/WEBR3.org/nathan%01me
Re: URI Comparisons: RFC 2616 vs. RDF
Hi Nathan, I largely agree but have a few quibbles :) On 20/01/2011 2:29 PM, Nathan wrote: Dave Reynolds wrote: The URI spec (rfc3986[1]) does allow this usage. In particular Section 6 Normalization and Comparison says: URI comparison is performed for some particular purpose. Protocols or implementations that compare URIs for different purposes will often be subject to differing design trade-offs in regards to how much effort should be spent in reducing aliased identifiers. This section describes various methods that may be used to compare URIs, the trade-offs between them, and the types of applications that might use them. and We use the terms different and equivalent to describe the possible outcomes of such comparisons, but there are many application-dependent versions of equivalence. While RDF predates this spec it seems to me that the RDF usage remains consistent with it. The purpose of comparison in RDF is different from that of cache retrieval of web pages or message delivery of email. Indeed, I also read though: For all URIs, the hexadecimal digits within a percent-encoding triplet (e.g., %3a versus %3A) are case-insensitive and therefore should be normalized to use uppercase letters for the digits A-F. When a URI uses components of the generic syntax, the component syntax equivalence rules always apply; namely, that the scheme and host are case-insensitive and therefore should be normalized to lowercase... - http://tools.ietf.org/html/rfc3986#section-6.2.2.1 And took the For all and always to literally mean for all and always. Those quotes come from section (6.2.2) describing normalization but the earlier quote is from the start of section 6 saying that choice of normalization is application dependent. I interpret the two together as *if* you are normalizing then always ...blah That was certainly the RIF position where we explicitly said that sections 6.2.2 and 6.2.3 of rfc3986 were not applicable. against both the RDF Specification [1] and the URI specification when they say /not/ to encode permitted US-ASCII characters (like ~ %7E)? Where did that example come from? The encoding consists of... %-escaping octets that do not correspond to permitted US-ASCII characters. - http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers. - http://tools.ietf.org/html/rfc3986#section-2.3 I read those quotes as saying do not encode permitted US-ASCII characters in RDF URI References. At what point have we suggested doing that? As above Sorry, I didn't mean to dispute that you shouldn't %-encode ~, I was wondering where the suggestion that you should do so came from. I believe there are some corner cases, such as the handling of spaces, which differ between the RDF spec and the IRI spec. This was down to timing. The RDF Core WG was doing its best to anticipate what the IRI spec would look like but couldn't wait until that was finalized. Resolving any such small discrepancies between that anticipation and the actual IRI specs is something I believe to be in scope for the proposed new RDF WG. So use normalized URIs in the first place. ... RDF/OWL/RIF aren't designed the way they are because someone thought it would be a good idea to allow such things to be used side by side or because they *want* people to use denormalized URIs. ... The point is that there is no single, simple, universal (i.e. across all schemes) normalization algorithm that could be used. The current approach gives stable, well-defined behaviour which doesn't change as people invent new URI schemes. The RDF serializations give you enough control to enable you to be certain about what URI you are talking about. Job done. Okay, I agree, and I'm really not looking to create a lot of work here, the general gist of what I'm hoping for is along the lines of: RDF Publishers MUST perform Case Normalization and Percent-Encoding Normalization on all URIs prior to publishing. When using relative URIs publishers SHOULD include a well defined base using a serialization specific mechanism. Publishers are advised to perform additional normalization steps as specified by URI (RFC 3986) where possible. RDF Consumers MAY normalize URIs they encounter and SHOULD perform Case Normalization and Percent-Encoding Normalization. Two RDF URIs are equal if and only if they compare as equal, character by character, as Unicode strings. I sort of OK with that but ... Terms like RDF Publisher and RDF Consumer need to be defined in order to make formal statements like these. The RDF/OWL/RIF specs are careful to define what sort of processors are subject to conformance statements and I don't think RDF
Standardizing linked data - was Re: URI Comparisons: RFC 2616 vs. RDF
Dave Reynolds wrote: Okay, I agree, and I'm really not looking to create a lot of work here, the general gist of what I'm hoping for is along the lines of: RDF Publishers MUST perform Case Normalization and Percent-Encoding Normalization on all URIs prior to publishing. When using relative URIs publishers SHOULD include a well defined base using a serialization specific mechanism. Publishers are advised to perform additional normalization steps as specified by URI (RFC 3986) where possible. RDF Consumers MAY normalize URIs they encounter and SHOULD perform Case Normalization and Percent-Encoding Normalization. Two RDF URIs are equal if and only if they compare as equal, character by character, as Unicode strings. I sort of OK with that but ... Terms like RDF Publisher and RDF Consumer need to be defined in order to make formal statements like these. The RDF/OWL/RIF specs are careful to define what sort of processors are subject to conformance statements and I don't think RDF Publisher is a conformance point for the existing specs. This may sound like nit-picking that's life with specifications. You need to be clear how the last para about RDF URIs relates to notions like RDF Consumer. I wonder whether you might want to instead define notions of Linked Data Publisher and Linked Data Consumer to which these MUST/MAY/SHOULD conformance statements apply. That way it is clear that a component such as an RDF store or RDF parser is correct in following the existing RDF specs and not doing any of these transformations but that in order to construct a Linked Data Consumer/Publisher some other component can be introduced to perform the normalizations. Linked Data as a set of constraints and conventions layered on top of the RDF/OWL specs. Fully agree, had the same conversation with DanC this afternoon and he too immediately suggested changing RDF Publisher/Consumer to Linked Data Publisher/Consumer. Also ties in with earlier comments about standardizing Linked Data, however it's done, or worded, my only care here is that it positively impacts the current situation, and doesn't negatively impact anybody else. The specific point on the normalization ladder would have to defined, of course, and you would need to define how to handle schemes unknown to the consumer. All this presupposes some work to formalize and specify linked data. Is there anything like that planned? In some ways Linked Data is an engineering experiment and benefits from that freedom to experiment. On the other hand interoperability eventually needs clear specifications. Unsure, but I'll also ask the question, is there anything planned? I'd certainly +1 standardization and do anything I could to help the process along. For many reasons it would be good to solve this at the publishing phase, allow normalization at the consuming phase (can't be precluded as intermediary components may normalize), and keep simple case sensitive string comparison throughout the stack and specs (so implementations remain simple and fast.) Agreed. cool, thanks again Dave, Nathan
Re: Standardizing linked data - was Re: URI Comparisons: RFC 2616 vs. RDF
Nathan wrote: Dave Reynolds wrote: All this presupposes some work to formalize and specify linked data. Is there anything like that planned? In some ways Linked Data is an engineering experiment and benefits from that freedom to experiment. On the other hand interoperability eventually needs clear specifications. Unsure, but I'll also ask the question, is there anything planned? I'd certainly +1 standardization and do anything I could to help the process along. or perhaps an IG/XG follow up to the SWEO, taking in to account Read Write Web of Data, hopefully with a some protocol or best practice report giving a migration path to standardization? There are certainly plenty of other groups to take in to account and consider in all of this, like the WebID XG. Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
On Thu, Jan 20, 2011 at 5:15 AM, Nathan nat...@webr3.org wrote: As far as I can see, that's only for a URI reference used within a namespace, and does not govern usage or normalization when you join the URI reference up with the local name to make the full URI. Out of interest, where is that process defined? I was looking for it the other day - for instance in the quoted specification we have the example: edi:price xmlns:edi='http://ecommerce.example.org/schema' units='Euro'32.18/edi:price Where's the bit of the XML specification which says you join them up by concatenating 'http://ecommerce.example.org/schema' with #(?assumed?) and 'Euro' to get 'http://ecommerce.example.org/schema#Euro'? My understanding is that this is governed by the definition of qnames. As I understand things, the concatenation you write would happen only if the attribute was defined in the schema to be an xsi:type http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#xsi_type, and without the #. The only case where a # would be added is when rdf:id or xml:id is used. And finally, this is why I specifically asked if the non-normalization of RDF URI References had XML Namespace heritage, which had then filtered down through OWL, SPARQL and RIF. I don't believe so. I believe the genesis are the reasons that I discussed earlier - the difficulty of actually implementing it combined with the indeterminacy. But I would be glad if someone else has better information and can either confirm or deny this. -Alan
Re: URI Comparisons: RFC 2616 vs. RDF
On Thu, Jan 20, 2011 at 11:15 AM, Nathan nat...@webr3.org wrote: Alan Ruttenberg wrote: On Wed, Jan 19, 2011 at 4:45 PM, Nathan nat...@webr3.org wrote: David Wood wrote: On Jan 19, 2011, at 10:59, Nathan wrote: ps: as an illustration of how engrained URI normalization is, I've capitalized the domain names in the to: and cc: fields, I do hope the mail still come through, and hope that you'll accept this email as being sent to you. Hopefully we'll also find this mail in the archives shortly at htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing. Heh. OK, I'll bite. Domain names in email addressing are defined in IETF RFC 2822 (and its predecessor RFC 822), which defers the interpretation to RFC 1035 (Domain names - implementation and specification). RFC 1035 section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP, are to be compared in a case-insensitive manner. As far as I know, the W3C specs do not so refer to RFC 1035. And I'll bite in the other direction, why not treat URIs as URIs? why go against both the RDF Specification [1] and the URI specification when they say /not/ to encode permitted US-ASCII characters (like ~ %7E)? why force case-sensitive matching on the scheme and domain on URIs matching the generic syntax when the specs say must be compared case insensitively? and so on and so forth. [AR] Which specs? The various URI/IRI specs and previous revisions of. http://www.w3.org/TR/REC-xml-names/#NSNameComparison URI references identifying namespaces .. In a namespace declaration, the URI reference is .. The URI references below are all different for the purposes of identifying namespaces .. The URI references below are also all different for the purposes of identifying namespaces .. So here is another spec that *explicitly* disagrees with the idea that URI normalization should be a built-in processing. As far as I can see, that's only for a URI reference used within a namespace, and does not govern usage or normalization when you join the URI reference up with the local name to make the full URI. Out of interest, where is that process defined? I was looking for it the other day - for instance in the quoted specification we have the example: edi:price xmlns:edi='http://ecommerce.example.org/schema' units='Euro'32.18/edi:price Where's the bit of the XML specification which says you join them up by concatenating 'http://ecommerce.example.org/schema' with #(?assumed?) and 'Euro' to get 'http://ecommerce.example.org/schema#Euro'? Actually you don't. A namespace is just that - a tuple (namespace, localname) in XML. That's why namespaces in XML are far all intents and purposes broken and why, to a large extent, Web browser developers in HTML stopped using them and hate implementing them in the DOM, and so refuse to have them in HTML5. And that's one reason RDF(A) will probably continue getting a sort of bad rap in the HTML world, as prefixes are not associated with just making URIs, but with this terrible namespace tuple. For an archeology of the relevant standards, check out Section What Namespaces Do of this paper. While the paper is focussed on why namespace documents are a mess, the relevant information is in that section and extensively referenced, with examples: http://xml.coverpages.org/HHalpinXMLVS-Extreme.html And finally, this is why I specifically asked if the non-normalization of RDF URI References had XML Namespace heritage, which had then filtered down through OWL, SPARQL and RIF. Indeed, they should be normalized in a sane manner across all Semantic Web specs, and dependencies on XML Namespaces should obviously be dropped IMHO. [AR] More to document, please: Which data is being junked and scrapped? will document, but essentially every statement made using a non normalized URI when other statements are also being made about the same resource using normalized URIs - the two most common cases for this will be when people are using CMS systems and enter their domain name as uppercase in some admin, only to have that filter through to URIs in serialized RDF/RDFa, and where bugs in software have led to inconsistent URIs over time (for instance where % encoding has been fixed, or a :80 has been removed from a URI). [AR] Hmm. Are you suggesting that the behavior of libraries and clients should have precedence over specification? My view is that one first looks to specifications, and then only if specifications are poor or do not speak to the issue do we look at existing behavior. Which is the case with namespaces and URI normalization :) Yes I am, that specification should standardize the behaviour of libraries and clients - the level of normalization in URIs published, consumed or used by these tools is often
Re: URI Comparisons: RFC 2616 vs. RDF
On 19/01/2011 3:55 AM, Alan Ruttenberg wrote: The information on how to fully determine equivalence according to the URI spec is distributed across a wide and growing number of different specifications (because it is schema dependent) and could, in principle, change over time. Because of the distributed nature of the information it is not feasible to fully implement these rules. Optionally implementing these rules (each implementor choosing where on the ladder they want to be) would mean that documents written in RDF (and derivative languages) would be interpreted differently by different implementations, which is an unacceptable feature of languages designed for unambiguous communication. The fact that the set of rules is growing and possibly changing would lead to a similar situation - documents that meant one thing at one time could mean different things later, which is also unacceptable, for the same reason. Well put, I meant to point out the implications of scheme-dependence and you've covered it very clearly. David (Wood) clarifies (surprisingly to me as well) that the issue of normalization could be addressed by the working group. I expect, however, that any proposed change would quickly be determined to be counter to the instructions given in the charter on Compatibility and Deployment Expectation, and if not, would be rejected after justified objections on this basis from reviewers outside the working group. +1 Dave
Re: URI Comparisons: RFC 2616 vs. RDF
Dave Reynolds wrote: On 19/01/2011 3:55 AM, Alan Ruttenberg wrote: The information on how to fully determine equivalence according to the URI spec is distributed across a wide and growing number of different specifications (because it is schema dependent) and could, in principle, change over time. Because of the distributed nature of the information it is not feasible to fully implement these rules. Optionally implementing these rules (each implementor choosing where on the ladder they want to be) would mean that documents written in RDF (and derivative languages) would be interpreted differently by different implementations, which is an unacceptable feature of languages designed for unambiguous communication. The fact that the set of rules is growing and possibly changing would lead to a similar situation - documents that meant one thing at one time could mean different things later, which is also unacceptable, for the same reason. Well put, I meant to point out the implications of scheme-dependence and you've covered it very clearly. Whilst I share the same end goal, I have to stress that *several important factors have been omitted*. The semantic web specifications are not the only ones which affect interoperability and compatibility with regard to URIs. Many (most) RDF serializations include the use of relative URIs, are affected by base mechanisms which are defined by the URIs RFC, dependent on the protocol, and by base mechanisms provided by host serialization languages, and each of the respective implementations thereof. This covers everything from implementations of the http protocol on clients, servers and intermediaries, through to implementations of the DOM in XML tooling, HTML tooling and the major browsers. It also covers every potential component which provides URI support, from open source libraries and classes through embedded support in black box applications. Every single one of the aforementioned are free to (silently) implement any of the URI normalization techniques in the URI/IRI RFCs. Each implementer of these specifications chooses where on the ladder they want to be, and that decision affects often determines the URIs seen by implementations of the semantic web specifications. These factors cannot be ignored, and they are the factors which the RDF specification and semantic web specifications must strive to be compatible with, and to normalize the actions of. Every additional step on the ladder added as a requirement to the RDF specification is a step closer to interoperability and compatibility. David (Wood) clarifies (surprisingly to me as well) that the issue of normalization could be addressed by the working group. I expect, however, that any proposed change would quickly be determined to be counter to the instructions given in the charter on Compatibility and Deployment Expectation, and if not, would be rejected after justified objections on this basis from reviewers outside the working group. +1 As per the above, I'd expect the polar opposite. +1 to compatibility (with the real, deployed, web - the one we all use) Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
Nathan, If you are going to make claims about the effect of other specifications on RDF, could you please include pointers to the parts of specifications that you are referring to, ideally with illustrative examples of the problems you are? Absent that it is too difficult to evaluate your claims. The conversations on such topics too often devolve into serial opinion dumping. If this is to be at all productive we need to be as precise as possible. -Alan
Re: URI Comparisons: RFC 2616 vs. RDF
Hi Alan, Alan Ruttenberg wrote: Nathan, If you are going to make claims about the effect of other specifications on RDF, could you please include pointers to the parts of specifications that you are referring to, ideally with illustrative examples of the problems you are? Absent that it is too difficult to evaluate your claims. The conversations on such topics too often devolve into serial opinion dumping. If this is to be at all productive we need to be as precise as possible. Good idea :) I'll create a new page on the wiki and add some examples over the next few days, then reply with a pointer later in the week. ps: as an illustration of how engrained URI normalization is, I've capitalized the domain names in the to: and cc: fields, I do hope the mail still come through, and hope that you'll accept this email as being sent to you. Hopefully we'll also find this mail in the archives shortly at htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing. Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
On 1/19/11 10:59 AM, Nathan wrote: htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing. Okay for Data Source Address Ref. (URL), no good for Entity (Data Item or Data Object) Name Ref., bar system specific handling via IFP property or owl:sameAs :-) -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: URI Comparisons: RFC 2616 vs. RDF
On 1/19/11 16:59 , Nathan wrote: Hi Alan, Alan Ruttenberg wrote: Nathan, If you are going to make claims about the effect of other specifications on RDF, could you please include pointers to the parts of specifications that you are referring to, ideally with illustrative examples of the problems you are? Absent that it is too difficult to evaluate your claims. The conversations on such topics too often devolve into serial opinion dumping. If this is to be at all productive we need to be as precise as possible. Good idea :) I'll create a new page on the wiki and add some examples over the next few days, then reply with a pointer later in the week. +1! ps: as an illustration of how engrained URI normalization is, I've capitalized the domain names in the to: and cc: fields, I do hope the mail still come through, and hope that you'll accept this email as being sent to you. Hopefully we'll also find this mail in the archives shortly at htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing. Best, Nathan Yrjänä -- Mr. Yrjana Rankka| gh...@openlinksw.com Developer, Virtuoso Team | http://www.openlinksw.com | Making Technology Work For You
Re: URI Comparisons: RFC 2616 vs. RDF
* [2011-01-19 11:11:20 -0500] Kingsley Idehen kide...@openlinksw.com écrit: ] On 1/19/11 10:59 AM, Nathan wrote: ] htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally ] I'd hope that any statements made using these URIs (asserted by man or ] machine) would remain valid regardless of the (incorrect?-)casing. ] ] Okay for Data Source Address Ref. (URL), no good for Entity (Data Item ] or Data Object) Name Ref., bar system specific handling via IFP property ] or owl:sameAs :-) FWIW I've just added a FuXi builtin for the curate tool [1] that does URI comparisons using ll.uri [2] (deliberately pushing the choice of place on the ladder into a library). It is used like this: @prefix curate: http://eris.okfn.org/ww/2010/12/curate#. { ?s1 ?p1 ?o1 . ?s2 ?p2 ?o2 . ?s1 curate:cmpURI ?s2 } = { ?s1 = ?s2 }. And results in statements like this: HTTP://example.org:80/ = HTTP://example.org:80/, http://EXAMPLE.ORG/, http://example.org/ . http://EXAMPLE.ORG/ = HTTP://example.org:80/, http://EXAMPLE.ORG/, http://example.org/ . http://example.org/ = HTTP://example.org:80/, http://EXAMPLE.ORG/, http://example.org/ . Cheers, -w [1] https://bitbucket.org/okfn/curate/src/1f6ba3c360c3/curate/builtins.py#cl-9 [2] http://www.livinglogic.de/Python/url/Howto.html -- William Waitesmailto:w...@styx.org http://eris.okfn.org/ww/ sip:w...@styx.org F4B3 39BF E775 CF42 0BAB 3DF0 BE40 A6DF B06F FD45
Re: URI Comparisons: RFC 2616 vs. RDF
On Jan 19, 2011, at 10:59, Nathan wrote: Hi Alan, Alan Ruttenberg wrote: Nathan, If you are going to make claims about the effect of other specifications on RDF, could you please include pointers to the parts of specifications that you are referring to, ideally with illustrative examples of the problems you are? Absent that it is too difficult to evaluate your claims. The conversations on such topics too often devolve into serial opinion dumping. If this is to be at all productive we need to be as precise as possible. Good idea :) I'll create a new page on the wiki and add some examples over the next few days, then reply with a pointer later in the week. ps: as an illustration of how engrained URI normalization is, I've capitalized the domain names in the to: and cc: fields, I do hope the mail still come through, and hope that you'll accept this email as being sent to you. Hopefully we'll also find this mail in the archives shortly at htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing. Heh. OK, I'll bite. Domain names in email addressing are defined in IETF RFC 2822 (and its predecessor RFC 822), which defers the interpretation to RFC 1035 (Domain names - implementation and specification). RFC 1035 section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP, are to be compared in a case-insensitive manner. As far as I know, the W3C specs do not so refer to RFC 1035. :) Regards, Dave Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
David Wood wrote: On Jan 19, 2011, at 10:59, Nathan wrote: ps: as an illustration of how engrained URI normalization is, I've capitalized the domain names in the to: and cc: fields, I do hope the mail still come through, and hope that you'll accept this email as being sent to you. Hopefully we'll also find this mail in the archives shortly at htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing. Heh. OK, I'll bite. Domain names in email addressing are defined in IETF RFC 2822 (and its predecessor RFC 822), which defers the interpretation to RFC 1035 (Domain names - implementation and specification). RFC 1035 section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP, are to be compared in a case-insensitive manner. As far as I know, the W3C specs do not so refer to RFC 1035. And I'll bite in the other direction, why not treat URIs as URIs? why go against both the RDF Specification [1] and the URI specification when they say /not/ to encode permitted US-ASCII characters (like ~ %7E)? why force case-sensitive matching on the scheme and domain on URIs matching the generic syntax when the specs say must be compared case insensitively? and so on and so forth. I have to be honest, I can't see what good this is doing anybody, in fact it's the complete opposite scenario, where data is being junked and scrapped because we are ignoring the specifications which are designed to enable interoperability and limit unexpected behaviour. I'm currently preparing a list of errors I'm finding in RDF, RDFa and linked data tooling to do with this, and I have to admit even I'm surprised at the sheer number of tools which are affected. Additionally there's a very nasty, and common, use case which I can't test fully, so would appreciate people taking the time to check their own libraries/clients, as follows: If you find some data with the following setup (example): @base htTp://EXAMPLE.org/foo/bar . #t x:rel ../baz . and then you follow your nose to htTp://EXAMPLE.org/baz, will you find any triples about it? (problem 1) and if there's no base on the second resource, and it uses relative URIs, then the base you'll be using is htTp://EXAMPLE.org/baz, and thus, you'll effectively create a new set of statements which the author never wrote, or intended (problem 2). In other words, in this scenario, no matter what you do you're either going to get no data (even though it's there) or get a set of statements which were never said by the author (because the casing is different). Further, essentially all RDFa ever encountered by a browser has the casing on all URIs in href and src, and all these which are resolved, automatically normalized - so even if you set the base to htTp://EXAMPLE.org/ or use it in a URI, browser tools, extensions, and js based libraries will only ever see the normalized URIs (and thus be incompatible with the rest of the RDF world). I'll continue on getting the specific examples for current RDF tooling and resources and get it on the wiki, but I'll say now that almost every tool I've encountered so far does it wrong in inconsistent non-compatible ways. Finally, I'll ask again, if anybody has any use case which benefits from htTp://EXAMPLE.org/%7efoo and http://example.org/~foo being classed as different RDF URIs, I'd love to hear it. [1] The encoding consists of: ... 2. %-escaping octets that do not correspond to permitted US-ASCII characters. - http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
On Wed, Jan 19, 2011 at 11:11 AM, Kingsley Idehen kide...@openlinksw.comwrote: On 1/19/11 10:59 AM, Nathan wrote: htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing. Okay for Data Source Address Ref. (URL), no good for Entity (Data Item or Data Object) Name Ref., bar system specific handling via IFP property or owl:sameAs :-) Kingsley, same for you as Nathan. To what specification do you refer to for the definitions and behavior of: - Data source address ref - Entity - Statement. -Alan -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: URI Comparisons: RFC 2616 vs. RDF
[for some reason my client isn't quoting previous mail properly, so my comments are prefixed with [AR]] On Wed, Jan 19, 2011 at 4:45 PM, Nathan nat...@webr3.org wrote: David Wood wrote: On Jan 19, 2011, at 10:59, Nathan wrote: ps: as an illustration of how engrained URI normalization is, I've capitalized the domain names in the to: and cc: fields, I do hope the mail still come through, and hope that you'll accept this email as being sent to you. Hopefully we'll also find this mail in the archives shortly at htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing. Heh. OK, I'll bite. Domain names in email addressing are defined in IETF RFC 2822 (and its predecessor RFC 822), which defers the interpretation to RFC 1035 (Domain names - implementation and specification). RFC 1035 section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP, are to be compared in a case-insensitive manner. As far as I know, the W3C specs do not so refer to RFC 1035. And I'll bite in the other direction, why not treat URIs as URIs? why go against both the RDF Specification [1] and the URI specification when they say /not/ to encode permitted US-ASCII characters (like ~ %7E)? why force case-sensitive matching on the scheme and domain on URIs matching the generic syntax when the specs say must be compared case insensitively? and so on and so forth. [AR] Which specs? (or is it singular spec) I just had a look at the XML namespace spec, for instance, which partially governs the RDF/XML serialization specification. http://www.w3.org/TR/REC-xml-names/#NSNameComparison URI references identifying namespaces are compared when determining whether a name belongs to a given namespace, and whether two names belong to the same namespace. [Definition: The two URIs are treated as strings, and they are *identical* if and only if the strings are identical, that is, if they are the same sequence of characters. ] The comparison is case-sensitive, and no %-escaping is done or undone. A consequence of this is that URI references which are not identical in this sense may resolve to the same resource. Examples include URI references which differ only in case or %-escaping, or which are in external entities which have different base URIs (but note that relative URIs are deprecated as namespace names). In a namespace declaration, the URI reference is the normalized valuehttp://www.w3.org/TR/REC-xml/#AVNormalize of the attribute, so replacement of XML character and entity references has already been done before any comparison. Examples: The URI references below are all different for the purposes of identifying namespaces, since they differ in case: http://www.example.org/wine http://www.Example.org/wine http://www.example.org/Wine The URI references below are also all different for the purposes of identifying namespaces: http://www.example.org/~wilbur http://www.example.org/%7ewilbur http://www.example.org/%7Ewilbur; So here is another spec that *explicitly* disagrees with the idea that URI normalization should be a built-in processing. I have to be honest, I can't see what good this is doing anybody, in fact it's the complete opposite scenario, where data is being junked and scrapped because we are ignoring the specifications which are designed to enable interoperability and limit unexpected behaviour. [AR] More to document, please: Which data is being junked and scrapped? I'm currently preparing a list of errors I'm finding in RDF, RDFa and linked data tooling to do with this, and I have to admit even I'm surprised at the sheer number of tools which are affected. Additionally there's a very nasty, and common, use case which I can't test fully, so would appreciate people taking the time to check their own libraries/clients, as follows: [AR] Hmm. Are you suggesting that the behavior of libraries and clients should have precedence over specification? My view is that one first looks to specifications, and then only if specifications are poor or do not speak to the issue do we look at existing behavior. If you find some data with the following setup (example): @base htTp://EXAMPLE.org/foo/bar . #t x:rel ../baz . and then you follow your nose to htTp://EXAMPLE.org/baz, will you find any triples about it? (problem 1) and if there's no base on the second resource, and it uses relative URIs, then the base you'll be using is htTp://EXAMPLE.org/baz, and thus, you'll effectively create a new set of statements which the author never wrote, or intended (problem 2). In other words, in this scenario, no matter what you do you're either going to get no data (even though it's there) or get a set of statements which were never said by the author (because the casing is different). [AR] I think there are many ways to lose in this scenario. For
Re: URI Comparisons: RFC 2616 vs. RDF
On Mon, 2011-01-17 at 18:16 +, Nathan wrote: Dave Reynolds wrote: On Mon, 2011-01-17 at 16:52 +, Nathan wrote: I'd suggest that it's a little more complex than that, and that this may be an issue to clear up in the next RDF WG (it's on the charter I believe). I beg to differ. The charter does state: Clarify the usage of IRI references for RDF resources, e.g., per SPARQL Query §1.2.4. However, I was under the impression that was simply removing the small difference between RDF URI References and the IRI spec (that they had anticipated). Specifically I thought the only substantive issue there was the treatment of space and many RDF processors already take the conservation position on that anyway. Likewise, apologies as I should have picked my choice of words more appropriately, I intended to say that the usage of IRI references was up for clarification, and if normalization were deemed an issue then the RDF WG may be the place to raise such an issue, and address if needed. OK, that makes sense. As for RIF and GRDDL, can anybody point me to the reasons why normalization are not performed, does this have xmlns heritage? Not as far as I know. At least in RIF we were just trying to be compatible with the RDF specs which (cwm not withstanding) do not specify normalization other than the IRI-compatible character encoding. Dave
Re: URI Comparisons: RFC 2616 vs. RDF
On Jan 17, 2011, at 13:16, Nathan wrote: Dave Reynolds wrote: On Mon, 2011-01-17 at 16:52 +, Nathan wrote: I'd suggest that it's a little more complex than that, and that this may be an issue to clear up in the next RDF WG (it's on the charter I believe). I beg to differ. The charter does state: Clarify the usage of IRI references for RDF resources, e.g., per SPARQL Query §1.2.4. However, I was under the impression that was simply removing the small difference between RDF URI References and the IRI spec (that they had anticipated). Specifically I thought the only substantive issue there was the treatment of space and many RDF processors already take the conservation position on that anyway. Likewise, apologies as I should have picked my choice of words more appropriately, I intended to say that the usage of IRI references was up for clarification, and if normalization were deemed an issue then the RDF WG may be the place to raise such an issue, and address if needed. I agree with that. The treatment of spaces is an example in the charter, not a constraint. Clarification may also occur in the updated RDF Primer if the community deems it necessary. Regards, Dave As for RIF and GRDDL, can anybody point me to the reasons why normalization are not performed, does this have xmlns heritage? Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
Hi Martin, I have URI's where case is important only at the terminal identifier. (HTML URI's in this example) http://lod.taxonconcept.org/ses/v6n7p.html http://lod.taxonconcept.org/ses/v6n7p.htmlshould be different than http://lod.taxonconcept.org/ses/v6N7p.html Am I correct in thinking that this is OK? I went with this structure so I could have short bit.ly like identifiers for potentially millions of species. Thanks, - Pete On Mon, Jan 17, 2011 at 9:51 AM, Martin Hepp martin.h...@ebusiness-unibw.org wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources a) in theory and b) in practice (e.g. in popular triplestores)? I did not test it yet, but I assume that not all implementations would treat http://purl.org/NET/c4dm/event.owl#Event HTTP://purl.org/NET/c4dm/event.owl#Event http://PURL.org/NET/c4dm/event.owl#Event http://purl.org:80/NET/c4dm/event.owl#Event as the same class. Any facts or opinions? Best Martin [1] http://www.ietf.org/rfc/rfc2616.txt martin hepp e-business web science research group universitaet der bundeswehr muenchen e-mail: h...@ebusiness-unibw.org phone: +49-(0)89-6004-4217 fax: +49-(0)89-6004-4620 www: http://www.unibw.de/ebusiness/ (group) http://www.heppnetz.de/ (personal) skype: mfhepp twitter: mfhepp -- --- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 TaxonConcept Knowledge Base http://www.taxonconcept.org/ / GeoSpecies Knowledge Base http://lod.geospecies.org/ About the GeoSpecies Knowledge Base http://about.geospecies.org/
Re: URI Comparisons: RFC 2616 vs. RDF
On Tue, Jan 18, 2011 at 3:47 AM, Dave Reynolds dave.e.reyno...@gmail.com wrote: As for RIF and GRDDL, can anybody point me to the reasons why normalization are not performed, does this have xmlns heritage? Not as far as I know. At least in RIF we were just trying to b compatible with the RDF specs which (cwm not withstanding) do not specify normalization other than the IRI-compatible character encoding. Similarly OWL. OWL says, following the sense of the anticipation of the IRI spec: Two IRIs are structurally equivalent if and only if their string representations are identical. As far as I can tell, you (Dave) are the only person in this conversation who cites the specification relevant to answering the question posed. That specification makes clear, as you have cited, exactly how RDF interpreters are to compare URI references. The information on how to fully determine equivalence according to the URI spec is distributed across a wide and growing number of different specifications (because it is schema dependent) and could, in principle, change over time. Because of the distributed nature of the information it is not feasible to fully implement these rules. Optionally implementing these rules (each implementor choosing where on the ladder they want to be) would mean that documents written in RDF (and derivative languages) would be interpreted differently by different implementations, which is an unacceptable feature of languages designed for unambiguous communication. The fact that the set of rules is growing and possibly changing would lead to a similar situation - documents that meant one thing at one time could mean different things later, which is also unacceptable, for the same reason. David (Wood) clarifies (surprisingly to me as well) that the issue of normalization could be addressed by the working group. I expect, however, that any proposed change would quickly be determined to be counter to the instructions given in the charter on Compatibility and Deployment Expectation, and if not, would be rejected after justified objections on this basis from reviewers outside the working group. -Alan
URI Comparisons: RFC 2616 vs. RDF
Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources a) in theory and b) in practice (e.g. in popular triplestores)? I did not test it yet, but I assume that not all implementations would treat http://purl.org/NET/c4dm/event.owl#Event HTTP://purl.org/NET/c4dm/event.owl#Event http://PURL.org/NET/c4dm/event.owl#Event http://purl.org:80/NET/c4dm/event.owl#Event as the same class. Any facts or opinions? Best Martin [1] http://www.ietf.org/rfc/rfc2616.txt martin hepp e-business web science research group universitaet der bundeswehr muenchen e-mail: h...@ebusiness-unibw.org phone: +49-(0)89-6004-4217 fax: +49-(0)89-6004-4620 www: http://www.unibw.de/ebusiness/ (group) http://www.heppnetz.de/ (personal) skype: mfhepp twitter: mfhepp
Re: URI Comparisons: RFC 2616 vs. RDF
On 1/17/11 10:51 AM, Martin Hepp wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources Yes, where an RDF resource is a Data Container at an Address (URL). Thus, equivalent results for de-referencing a URL en route to accessing data. No, when resource also implies an Entity (Data Item or Data Object) that is assigned a Name via URI. The examples above strike me as URLs. Of course, cURL could indicate otherwise, but for now (via my visual senses) they appear to be URLs (resource addresses). a) in theory and b) in practice (e.g. in popular triplestores)? I did not test it yet, but I assume that not all implementations would treat http://purl.org/NET/c4dm/event.owl#Event HTTP://purl.org/NET/c4dm/event.owl#Event http://PURL.org/NET/c4dm/event.owl#Event http://purl.org:80/NET/c4dm/event.owl#Event as the same class. Any facts or opinions? See my comments above. Kingsley Best Martin [1] http://www.ietf.org/rfc/rfc2616.txt martin hepp e-business web science research group universitaet der bundeswehr muenchen e-mail: h...@ebusiness-unibw.org phone: +49-(0)89-6004-4217 fax: +49-(0)89-6004-4620 www: http://www.unibw.de/ebusiness/ (group) http://www.heppnetz.de/ (personal) skype: mfhepp twitter: mfhepp -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: URI Comparisons: RFC 2616 vs. RDF
On Mon, 2011-01-17 at 16:51 +0100, Martin Hepp wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources a) in theory and No. RDF Concepts defines equality of RDF URI References [1] as simply character-by-character equality of the %-encoded UTF-8 Unicode strings. Note the final Note in that section: Note: Because of the risk of confusion between RDF URI references that would be equivalent if derefenced, the use of %-escaped characters in RDF URI references is strongly discouraged. which explicitly calls out the difference between URI equivalence (dereference to the same resource) and RDF URI Reference equality. BTW the more up to date RFC for looking at equivalence (as opposed to equality) issues is probably the IRI spec [2] which defines a comparison ladder for testing equivalence. Dave [1] http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-Graph-URIref [2] http://www.ietf.org/rfc/rfc3987.txt
Re: URI Comparisons: RFC 2616 vs. RDF
Dave Reynolds wrote: On Mon, 2011-01-17 at 16:51 +0100, Martin Hepp wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources a) in theory and No. RDF Concepts defines equality of RDF URI References [1] as simply character-by-character equality of the %-encoded UTF-8 Unicode strings. Note the final Note in that section: Note: Because of the risk of confusion between RDF URI references that would be equivalent if derefenced, the use of %-escaped characters in RDF URI references is strongly discouraged. which explicitly calls out the difference between URI equivalence (dereference to the same resource) and RDF URI Reference equality. I'd suggest that it's a little more complex than that, and that this may be an issue to clear up in the next RDF WG (it's on the charter I believe). For example: When a URI uses components of the generic syntax, the component syntax equivalence rules always apply; namely, that the scheme and host are case-insensitive and therefore should be normalized to lowercase. For example, the URI HTTP://www.EXAMPLE.com/ is equivalent to http://www.example.com/. - http://tools.ietf.org/html/rfc3986#section-6.2.2.1 However, that's only for URIs which use the generic syntax (which most URIs we ever touch do use). It would be great if a normalized-IRI with specific normalization rules could be drafted up as part of the next WG charter - after all they are a pretty pivotal part of the sem web setup, and it would be relatively easy to clear up these issues. Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
On 1/17/11 11:37 AM, Dave Reynolds wrote: On Mon, 2011-01-17 at 16:51 +0100, Martin Hepp wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources a) in theory and No. RDF Concepts defines equality of RDF URI References [1] as simply character-by-character equality of the %-encoded UTF-8 Unicode strings. Note the final Note in that section: Note: Because of the risk of confusion between RDF URI references that would be equivalent if derefenced, the use of %-escaped characters in RDF URI references is strongly discouraged. which explicitly calls out the difference between URI equivalence (dereference to the same resource) and RDF URI Reference equality. BTW the more up to date RFC for looking at equivalence (as opposed to equality) issues is probably the IRI spec [2] which defines a comparison ladder for testing equivalence. Dave [1] http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-Graph-URIref [2] http://www.ietf.org/rfc/rfc3987.txt Dave, Important RFC excerpt: A mapping from IRIs to URIs is defined, which means that IRIs can be used instead of URIs, where appropriate, to identify resources. . The context for resources is not equivalent or identical to the notion of an Identifier used as a Data Object (Item or Entity) Name. This context is all about good old machine addressable resources. In Linked Data context (aka. Distributed Data Object context) an Identifier usex as a Name Reference can de-reference to a resource that bears (or carries) a Representation of its Description ( a graph pictorial where Attribute=Value pairs coalesce around a Name Reference). Names are Names, if they are Unique, they should be Unique. Of course, not so when dealing with Addresses of data, which is what the RFC context applies to as I understand it. Until we clarify Resource confusion will reign. -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: URI Comparisons: RFC 2616 vs. RDF
Kingsley Idehen wrote: On 1/17/11 10:51 AM, Martin Hepp wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources Yes, where an RDF resource is a Data Container at an Address (URL). Thus, equivalent results for de-referencing a URL en route to accessing data. No, when resource also implies an Entity (Data Item or Data Object) that is assigned a Name via URI. Logically, yes on both counts, we should/could be normalizing these URIs as we consume and publish using the syntax based normalization rules [1] which apply to all URI/IRIs with the generic syntax (such as the examples above) Any client consuming data, or server publishing data, can use the normalization rules, so it stands to reason that it's pretty important that we all do it to avoid false negatives. [1] http://tools.ietf.org/html/rfc3986#section-6.2.2 Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
Hi, I am particularly interested about this issue, because I am currently struggling with such a problem within the Sindice project. Given also the answer of Dave, what would be the best practices within a (RDF) system to correctly handle URIs ? Should the system implements URI normalisation based on the RFC 2616 exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. and should take care of decoding all percent-encoded characters ? However, when dealing with percent-encoded character, some cases become tricky to handle. For example, some URIs [1] have a space encoded at the end of the string. By decoding it, certain systems/applications could automatically trim it. Also, some URIs [2] are 'recursively' encoded, and need multiple decoding pass before getting the right one. [1] http://geo.linkeddata.es/resource/Pozo/Moro%2C%20Pou%2047%20o%20del%20 [2] http://sioc-project.org/sioc/user/1%2523user Any opinions on how to correctly handle URis is welcome. It will be useful to have a document for best practices for correctly handling URIs in a RDF system. Best, -- Renaud Delbru On 17/01/11 15:51, Martin Hepp wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources a) in theory and b) in practice (e.g. in popular triplestores)? I did not test it yet, but I assume that not all implementations would treat http://purl.org/NET/c4dm/event.owl#Event HTTP://purl.org/NET/c4dm/event.owl#Event http://PURL.org/NET/c4dm/event.owl#Event http://purl.org:80/NET/c4dm/event.owl#Event as the same class. Any facts or opinions? Best Martin [1] http://www.ietf.org/rfc/rfc2616.txt martin hepp e-business web science research group universitaet der bundeswehr muenchen e-mail: h...@ebusiness-unibw.org phone: +49-(0)89-6004-4217 fax: +49-(0)89-6004-4620 www: http://www.unibw.de/ebusiness/ (group) http://www.heppnetz.de/ (personal) skype: mfhepp twitter: mfhepp
Re: URI Comparisons: RFC 2616 vs. RDF
Better be a bit more specific.. in-line.. Nathan wrote: Kingsley Idehen wrote: On 1/17/11 10:51 AM, Martin Hepp wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html As per the percent encoding rules and the set of unreserved characters [1], percent encoded octets in certain ranges (see [1]) should not be created by URI producers, and when found in a URI should be decoded correctly, this includes %7E - also percent encoding is case insensitive so %7e and %7E are equivalent, thus you should not produce URIs like this, and when found you should fix the error, to produce: http://abc.com:80/~smith/home.html http://ABC.com/~smith/home.html http://ABC.com:/~smith/home.html The above URIs all use the generic syntax, so the generic component syntax equivalence rules always apply [2], so normalization after these rules would produce: http://abc.com:80/~smith/home.html http://abc.com/~smith/home.html http://abc.com:/~smith/home.html Then finally, scheme specific normalization rules can be applied which treat all the port values as being equivalent (for the purpose of naming and dereferencing, it's the specification for URIs with that scheme), which allows you to normalize to: http://abc.com/~smith/home.html http://abc.com/~smith/home.html http://abc.com/~smith/home.html [1] http://tools.ietf.org/html/rfc3986#section-6.2.2.1 [2] http://tools.ietf.org/html/rfc3986#section-2.3 [3] http://tools.ietf.org/html/rfc3986#section-6.2.3 Hope that helps refine my previous comments, Does this also hold for identifying RDF resources Yes, where an RDF resource is a Data Container at an Address (URL). Thus, equivalent results for de-referencing a URL en route to accessing data. No, when resource also implies an Entity (Data Item or Data Object) that is assigned a Name via URI. Logically, yes on both counts, we should/could be normalizing these URIs as we consume and publish using the syntax based normalization rules [1] which apply to all URI/IRIs with the generic syntax (such as the examples above) Any client consuming data, or server publishing data, can use the normalization rules, so it stands to reason that it's pretty important that we all do it to avoid false negatives. [1] http://tools.ietf.org/html/rfc3986#section-6.2.2 Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
Nuno Bettencourt wrote: Hi, Even though I'll be deviating the point just a bit, since we're discussing URI comparison in terms of RDF, I would like to request some help. I have a doubt about URLs when it comes to RDF URI comparison. Is there any RFC that establishes if http://abc.com:80/~smith/home.html https://abc.com:80/~smith/home.html or even ftp://abc.com:80/~smith/home.html should or not be considered the same resource? No, and no such rules can be written (as they are case specific, and all the above URIs could easily, and often do, point to differing resources) - if all URIs point to the same resource then it should be stated as such by some other means, which in RDF would mean owl:sameas. Best, Nathan
RE: URI Comparisons: RFC 2616 vs. RDF
Hi, Even though I'll be deviating the point just a bit, since we're discussing URI comparison in terms of RDF, I would like to request some help. I have a doubt about URLs when it comes to RDF URI comparison. Is there any RFC that establishes if http://abc.com:80/~smith/home.html https://abc.com:80/~smith/home.html or even ftp://abc.com:80/~smith/home.html should or not be considered the same resource? Best regards, Nuno Bettencourt -Original Message- From: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] On Behalf Of Nathan Sent: segunda-feira, 17 de Janeiro de 2011 16:53 To: Dave Reynolds; Sandro Hawke Cc: Martin Hepp; public-lod@w3.org Subject: Re: URI Comparisons: RFC 2616 vs. RDF Dave Reynolds wrote: On Mon, 2011-01-17 at 16:51 +0100, Martin Hepp wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources a) in theory and No. RDF Concepts defines equality of RDF URI References [1] as simply character-by-character equality of the %-encoded UTF-8 Unicode strings. Note the final Note in that section: Note: Because of the risk of confusion between RDF URI references that would be equivalent if derefenced, the use of %-escaped characters in RDF URI references is strongly discouraged. which explicitly calls out the difference between URI equivalence (dereference to the same resource) and RDF URI Reference equality. I'd suggest that it's a little more complex than that, and that this may be an issue to clear up in the next RDF WG (it's on the charter I believe). For example: When a URI uses components of the generic syntax, the component syntax equivalence rules always apply; namely, that the scheme and host are case-insensitive and therefore should be normalized to lowercase. For example, the URI HTTP://www.EXAMPLE.com/ is equivalent to http://www.example.com/. - http://tools.ietf.org/html/rfc3986#section-6.2.2.1 However, that's only for URIs which use the generic syntax (which most URIs we ever touch do use). It would be great if a normalized-IRI with specific normalization rules could be drafted up as part of the next WG charter - after all they are a pretty pivotal part of the sem web setup, and it would be relatively easy to clear up these issues. Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
In the short term, it sounds like there's a gap in the code-ecosystem for a really lightweight tool which took a stream of N-Triples and just output a normalised stream of N-Triples ready for import. The examples below would make a good initial test set for it. I'd write it if I didn't have a bunch of code-bunnies biting my ankles and demanding to be created. As for triple stores; I know that the number of triples-per-second on import can be important, so if you already know you're data is clean you'd want to at least make normalise-on-input optional to improve performance. On 17/01/11 16:57, Nathan wrote: Kingsley Idehen wrote: On 1/17/11 10:51 AM, Martin Hepp wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources Yes, where an RDF resource is a Data Container at an Address (URL). Thus, equivalent results for de-referencing a URL en route to accessing data. No, when resource also implies an Entity (Data Item or Data Object) that is assigned a Name via URI. Logically, yes on both counts, we should/could be normalizing these URIs as we consume and publish using the syntax based normalization rules [1] which apply to all URI/IRIs with the generic syntax (such as the examples above) Any client consuming data, or server publishing data, can use the normalization rules, so it stands to reason that it's pretty important that we all do it to avoid false negatives. [1] http://tools.ietf.org/html/rfc3986#section-6.2.2 Best, Nathan -- Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248 / Lead Developer, EPrints Project, http://eprints.org/ / Web Projects Manager, ECS, University of Southampton, http://www.ecs.soton.ac.uk/ / Webmaster, Web Science Trust, http://www.webscience.org/
RE: URI Comparisons: RFC 2616 vs. RDF
Hi, The doubt just kept on because in all protocols we were still referring to the same URN. Thank you for your explanation, and we've been using the owl:sameAs property for this. Nuno Bettencourt -Original Message- From: Nathan [mailto:nat...@webr3.org] Sent: segunda-feira, 17 de Janeiro de 2011 17:34 To: Nuno Bettencourt Cc: 'Dave Reynolds'; 'Martin Hepp'; public-lod@w3.org Subject: Re: URI Comparisons: RFC 2616 vs. RDF Nuno Bettencourt wrote: Hi, Even though I'll be deviating the point just a bit, since we're discussing URI comparison in terms of RDF, I would like to request some help. I have a doubt about URLs when it comes to RDF URI comparison. Is there any RFC that establishes if http://abc.com:80/~smith/home.html https://abc.com:80/~smith/home.html or even ftp://abc.com:80/~smith/home.html should or not be considered the same resource? No, and no such rules can be written (as they are case specific, and all the above URIs could easily, and often do, point to differing resources) - if all URIs point to the same resource then it should be stated as such by some other means, which in RDF would mean owl:sameas. Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
Nuno Bettencourt wrote: Hi, The doubt just kept on because in all protocols we were still referring to the same URN. do you mean that there were RDF statements which linked each of the protocol specific URIs to a single URN via the same property? eg: http://... x:foo urn:here https://... x:foo urn:here ftp://... x:foo urn:here If so, then you could define the property (x:foo above) as an Inverse Functional Property which would take care of the sameness for you. Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
On Mon, 2011-01-17 at 16:52 +, Nathan wrote: Dave Reynolds wrote: On Mon, 2011-01-17 at 16:51 +0100, Martin Hepp wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources a) in theory and No. RDF Concepts defines equality of RDF URI References [1] as simply character-by-character equality of the %-encoded UTF-8 Unicode strings. Note the final Note in that section: Note: Because of the risk of confusion between RDF URI references that would be equivalent if derefenced, the use of %-escaped characters in RDF URI references is strongly discouraged. which explicitly calls out the difference between URI equivalence (dereference to the same resource) and RDF URI Reference equality. I'd suggest that it's a little more complex than that, and that this may be an issue to clear up in the next RDF WG (it's on the charter I believe). I beg to differ. The charter does state: Clarify the usage of IRI references for RDF resources, e.g., per SPARQL Query §1.2.4. However, I was under the impression that was simply removing the small difference between RDF URI References and the IRI spec (that they had anticipated). Specifically I thought the only substantive issue there was the treatment of space and many RDF processors already take the conservation position on that anyway. Replacing encoded string equality by deference-equivalence would be a pretty big change to RDF and I hadn't realized that was being considered. Could one of the nominated chairs or a W3C rep clarify this? For example: When a URI uses components of the generic syntax, the component syntax equivalence rules always apply; namely, that the scheme and host are case-insensitive and therefore should be normalized to lowercase. For example, the URI HTTP://www.EXAMPLE.com/ is equivalent to http://www.example.com/. - http://tools.ietf.org/html/rfc3986#section-6.2.2.1 Sure but the later RDF-related specs such as GRDDL and RIF clarify the application of that in RDF. For example in RIF [1] we said: Neither Syntax-Based Normalization nor Scheme-Based Normalization (described in Sections 6.2.2 and 6.2.3 of RFC-3986) are performed. A form of words that, I think, we lifted verbatim from GRDDL which in turn had chosen them to clarify how the original RDF URI References spec should be interpreted in the light of the updated URI/IRI RFCs. Changing RDF to require syntax or scheme based normalization would require changing at least RIF and GRDDL as well. If that was really on the cards I would have expected it to have been more broadly publicized. Dave [1] http://www.w3.org/TR/2010/PR-rif-dtb-20100511/#Relative_IRIs
Re: URI Comparisons: RFC 2616 vs. RDF
Dave Reynolds wrote: On Mon, 2011-01-17 at 16:52 +, Nathan wrote: I'd suggest that it's a little more complex than that, and that this may be an issue to clear up in the next RDF WG (it's on the charter I believe). I beg to differ. The charter does state: Clarify the usage of IRI references for RDF resources, e.g., per SPARQL Query §1.2.4. However, I was under the impression that was simply removing the small difference between RDF URI References and the IRI spec (that they had anticipated). Specifically I thought the only substantive issue there was the treatment of space and many RDF processors already take the conservation position on that anyway. Likewise, apologies as I should have picked my choice of words more appropriately, I intended to say that the usage of IRI references was up for clarification, and if normalization were deemed an issue then the RDF WG may be the place to raise such an issue, and address if needed. As for RIF and GRDDL, can anybody point me to the reasons why normalization are not performed, does this have xmlns heritage? Best, Nathan
Re: URI Comparisons: RFC 2616 vs. RDF
On 1/17/11 12:27 PM, Nuno Bettencourt wrote: Hi, Even though I'll be deviating the point just a bit, since we're discussing URI comparison in terms of RDF, I would like to request some help. I have a doubt about URLs when it comes to RDF URI comparison. Is there any RFC that establishes if http://abc.com:80/~smith/home.html https://abc.com:80/~smith/home.html or even ftp://abc.com:80/~smith/home.html should or not be considered the same resource? All of the above are Addresses (based on what I can infer via my visual senses). The URI abstraction enables multiple scheme data access. ftp: and http: are schemes. None of them isA resource. They simply provide access to data why may be serialized in a variety of formats to a user agent that de-references any of these Addresses. Basically, network aware pointers with data representation dexterity courtesy of URI abstraction and HTTP's content negotiation. Kingsley Best regards, Nuno Bettencourt -Original Message- From: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] On Behalf Of Nathan Sent: segunda-feira, 17 de Janeiro de 2011 16:53 To: Dave Reynolds; Sandro Hawke Cc: Martin Hepp; public-lod@w3.org Subject: Re: URI Comparisons: RFC 2616 vs. RDF Dave Reynolds wrote: On Mon, 2011-01-17 at 16:51 +0100, Martin Hepp wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources a) in theory and No. RDF Concepts defines equality of RDF URI References [1] as simply character-by-character equality of the %-encoded UTF-8 Unicode strings. Note the final Note in that section: Note: Because of the risk of confusion between RDF URI references that would be equivalent if derefenced, the use of %-escaped characters in RDF URI references is strongly discouraged. which explicitly calls out the difference between URI equivalence (dereference to the same resource) and RDF URI Reference equality. I'd suggest that it's a little more complex than that, and that this may be an issue to clear up in the next RDF WG (it's on the charter I believe). For example: When a URI uses components of the generic syntax, the component syntax equivalence rules always apply; namely, that the scheme and host are case-insensitive and therefore should be normalized to lowercase. For example, the URIHTTP://www.EXAMPLE.com/ is equivalent tohttp://www.example.com/. - http://tools.ietf.org/html/rfc3986#section-6.2.2.1 However, that's only for URIs which use the generic syntax (which most URIs we ever touch do use). It would be great if a normalized-IRI with specific normalization rules could be drafted up as part of the next WG charter - after all they are a pretty pivotal part of the sem web setup, and it would be relatively easy to clear up these issues. Best, Nathan -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: URI Comparisons: RFC 2616 vs. RDF
On 2011-01 -17, at 16:37, Dave Reynolds wrote: On Mon, 2011-01-17 at 16:51 +0100, Martin Hepp wrote: Dear all: RFC 2616 [1, section 3.2.3] says that When comparing two URIs to decide if they match or not, a client SHOULD use a case-sensitive octet-by-octet comparison of the entire URIs, with these exceptions: - A port that is empty or not given is equivalent to the default port for that URI-reference; - Comparisons of host names MUST be case-insensitive; - Comparisons of scheme names MUST be case-insensitive; - An empty abs_path is equivalent to an abs_path of /. Characters other than those in the reserved and unsafe sets (see RFC 2396 [42]) are equivalent to their % HEX HEX encoding. For example, the following three URIs are equivalent: http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Does this also hold for identifying RDF resources a) in theory and Yes this does hold for RDF systems. You can't guarantee that all RDF systems will do it, so RDF systems should in general exchange canonicalized URIs. There is a ladder of levels at which smarter and smarter systems are aware of more and more equivalences. Good to make your system smart and not end up with widow graphs about http://WWW.w3.org/foo. cwm for example canonicalizes URIs when it loads them into the store. No. RDF Concepts defines equality of RDF URI References [1] as simply character-by-character equality of the %-encoded UTF-8 Unicode strings. Note the final Note in that section: Note: Because of the risk of confusion between RDF URI references that would be equivalent if derefenced, the use of %-escaped characters in RDF URI references is strongly discouraged. which explicitly calls out the difference between URI equivalence (dereference to the same resource) and RDF URI Reference equality. BTW the more up to date RFC for looking at equivalence (as opposed to equality) issues is probably the IRI spec [2] which defines a comparison ladder for testing equivalence. Exactly. Dave [1] http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-Graph-URIref [2] http://www.ietf.org/rfc/rfc3987.txt
Re: URI Comparisons: RFC 2616 vs. RDF
On 1/17/11 4:54 PM, Nuno Bettencourt wrote: Hi, thank you for the suggestion. This had been a problem before, which in fact becomes easier to solve like that. In my current situation, we were dealing with public/private/protected resources (files), secured by https. So, if a person/agent has a private/protected resource (file) (that only shares with some specific individuals and is only accessible using https protocol) it would be hosted under https://server/abc.html. For this, I would have for example, the following triple: 1) http://server/#me dc:publisher https://server/abc.html Nevertheless, if afterwards I publically publish that resource (file), for technical reasons that same resource (file) would be given a new URI http://server/abc.html so that it would not require authentication and a new triple would be created (for terms of simplicity I'm omitting other triples that are generated): 2) http://server/#me dc:publisher http://server/abc.html In fact, both those resources (files) are the same, mapped for the same physical file but while the first required SSL credentials, the second does not. In order for those users who before had access to the private resource, to keep accessing the resource, since it is now public (but has been moved from protected), I would had a triple in order for the semantic system to be able to retrieve the same resource, since it is no longer available under its original location. 3) https://server/abc.html owl:sameAs http://server/abc.html But at this point your context has changed, you are now make an assertion in a deductive data space. Basically, a record that is also a proposition re. RDF (or any other) deductive system. Again, the moment you make a triple, you are making a propositional statement. And the moment you do that, in the context of HTTP based Linked Data, it has to be something like this: https://server/abc.html#this owl:sameAs http://server/abc.html#this . If you don't care about Linked Data via HTTP user agents following links etc. ; meaning you're happy with a local graph of propositions that is SPARQL queryable, for instance, then this works too: https://server/abc.html owl:sameAs http://server/abc.html . This unfortunately leads to a minimal and probably unrealistic problem like an open URI https://server/abc.html that might not have any content, since there's no need for it as it has become public and no authentication is needed for accessing it - but it is necessary to keep that triple 1) alive as others might be consuming that information. Triple 3) helps those in finding the resource again. One and more rich possible solution might be implementing time reasoning mechanisms over this, in order to eliminate those 'fake' URIs, but that would grow the triple store and make reasoning even more time consuming (for now). No need for fake URIs (I guess you might think the #this above == fake), it's just comes down to Name References and the need for them to resolve to something useful, which may or may not be useful (e.g. navigable) to an HTTP agent, or deliver factual basis for inference by a deduction oriented engine (logic reasoner). I hope this helps. Kingsley Nuno -Original Message- From: Nathan [mailto:nat...@webr3.org] Sent: segunda-feira, 17 de Janeiro de 2011 18:06 To: Nuno Bettencourt Cc: public-lod@w3.org Subject: Re: URI Comparisons: RFC 2616 vs. RDF Nuno Bettencourt wrote: Hi, The doubt just kept on because in all protocols we were still referring to the same URN. do you mean that there were RDF statements which linked each of the protocol specific URIs to a single URN via the same property? eg: http://... x:foourn:here https://... x:foourn:here ftp://... x:foourn:here If so, then you could define the property (x:foo above) as an Inverse Functional Property which would take care of the sameness for you. Best, Nathan -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen