Re: [whatwg] [Fwd: Re: Helping people seaching for content filtered by license]
Nils Dagsson Moskopp writes: Am Freitag, den 08.05.2009, 19:57 + schrieb Ian Hickson: * Tara runs a video sharing web site for people who want licensing information to be included with their videos. When Paul wants to blog about a video, he can paste a fragment of HTML provided by Tara directly into his blog. The video is then available inline in his blog, along with any licensing information about the video. This can be done with HTML5 today. For example, here is the markup you could include to allow someone to embed a video on their site while including the copyright or license information: figure video src=http://example.com/videodata/sJf-ulirNRk; controls a href=http://video.example.com/watch?v=sJf-ulirNRk;Watch/a /video legend Pillar post surgery, starting to heal. smallcopy; copyright 2008 Pillar. All Rights Reserved./small /legend /figure Seriously, I don't get it. Is there really so much entrenched (widely deployed, a mess, IE-style) software out there relying on @rel=license meaning license of a single main content blob Merely using rel=license in the above example would not cause the copyright message to be displayed to users. that an unambigous (read: machine-readable) writeup of part licenses is impossible ? Why does the license information need to be machine-readable in this case? (It may need to be for a different scenario, but that would be dealt with separately.) The example above shows this for a movie, but it works as well for a photo: figure img src=http://nearimpossible.com/DSCF0070-1-tm.jpg; alt= legend Picture by Bob. smalla href=http://creativecommons.org/licenses/by-nc-sa/2.5/legalcode;Creative Commons Attribution-Noncommercial-Share Alike 2.5 Generic License/a/small /legend /figure Can I infer from this that an a in a small inside a legend is some kind of microformat for licensing information ? No. But if a human sees a string that mentions © copyright or license then she's likely to realize it's licencing information. And if it's placed next to a picture it's conventional to interpret that as applying to a picture. It's also conventional for such information to be small, because it's usually not the main content the user is interested in when choosing to view the page. Magazines and the like have been using this convention for years, without any need to explicitly define what indicates licensing information, seemingly without any ambiguity or confusion. Smylers
Re: [whatwg] [Fwd: Re: Helping people seaching for content filtered by license]
On Fri, May 15, 2009 at 8:40 AM, Smylers smyl...@stripey.com wrote: Nils Dagsson Moskopp writes: Am Freitag, den 08.05.2009, 19:57 + schrieb Ian Hickson: * Tara runs a video sharing web site for people who want licensing information to be included with their videos. When Paul wants to blog about a video, he can paste a fragment of HTML provided by Tara directly into his blog. The video is then available inline in his blog, along with any licensing information about the video. [...] Why does the license information need to be machine-readable in this case? (It may need to be for a different scenario, but that would be dealt with separately.) It would need to be machine-readable for tools like http://search.creativecommons.org/ to do their job: check the license against the engine's built-in knowledge of some licenses, and figure out if it is suitable for the usages the user has requested (like search for content I can build upon or search for content I can use commercialy). Ideally, a search engine should have enough with finding the video on either Tara's site *or* Paul's blog for it to be available for users. Just my two cents.
Re: [whatwg] [Fwd: Re: Helping people seaching for content filtered by license]
Eduard Pascual writes: On Fri, May 15, 2009 at 8:40 AM, Smylers smyl...@stripey.com wrote: Am Freitag, den 08.05.2009, 19:57 + schrieb Ian Hickson: * Tara runs a video sharing web site for people who want licensing information to be included with their videos. When Paul wants to blog about a video, he can paste a fragment of HTML provided by Tara directly into his blog. The video is then available inline in his blog, along with any licensing information about the video. Why does the license information need to be machine-readable in this case? (It may need to be for a different scenario, but that would be dealt with separately.) It would need to be machine-readable for tools like http://search.creativecommons.org/ to do their job: check the license against the engine's built-in knowledge of some licenses, and figure out if it is suitable for the usages the user has requested (like search for content I can build upon or search for content I can use commercialy). Ideally, a search engine should have enough with finding the video on either Tara's site *or* Paul's blog for it to be available for users. Yeah, that sounds plausible. However that's what I meant by a different scenario -- adding criteria to the above, specifically about searching. Hixie attempted to address this case too: Admittedly, if this scenario is taken in the context of the first scenario, meaning that Bob wants this image to be discoverable through search, but doesn't want to include it on a page of its own, then extra syntax to mark this particular image up would be useful. However, in my research I found very few such cases. In every case where I found multiple media items on a single page with no dedicated page, either every item was licensed identically and was the main content of the page, or each item had its own separate page, or the items were licensed under the same license as the page. In all three of these cases, rel=license already solves the problem today. To which Nils responded: Relying on linked pages just to get licensing information would be, well, massive overhead. Still, you are right - most blogs using many pictures have dedicated pages. It's perfectly valid to disagree with this being sufficient (I personally have no view either way on the matter). I was just clarifying that the legend mark-up example wasn't attempting to address this case, and wasn't proposing legendsmall (or whatever) as a machine-readable microformat. Smylers
Re: [whatwg] Annotating structured data that HTML has no semantics for
On Thu, May 14, 2009 at 10:17 PM, Maciej Stachowiak m...@apple.com wrote: [...] From my cursory study, I think microdata could subsume many of the use cases of both microformats and RDFa. Maybe. But microformats and RDFa can handle *all* of these cases. Again, which are the benefits of creating something entirely new to replace what already exists while it can't even handle all the cases of what it is replacing? Both the new syntax, and the cases restrictions, are costs: what are these costs buying? If it's not clear what we are getting for these costs, it is impossible to evaluate whether the costs are worth it or not. It seems to me that it avoids much of what microformats advocates find objectionable Could you specify, please? Do you mean anything else than WHATWG's almost irrational hate toward CURIEs and everything that involves prefixes? but at the same time it seems it can represent a full RDF data model. No, it *can't* represent a full RDF model: it has already been shown several times on this thread. Thus, I think we have the potential to get one solution that works for everyone. RDFa itself doesn't work for everyone; but microdata is even more restricted: it leaves out the cases that RDFa leaves out, but it also leaves out some cases that RDFa was able to handle. So, where do you see such potential? I'm not 100% sure microdata can really achieve this, but I think making the attempt is a positive step. What do you mean by making the attempt? If there is something microdata can't handle, it won't be able to handle it without changing the spec. If you meant that evolving that microdata proposal towards something that works for everyone is a positive step, then I agree; but if you meant that engraving this microdata approach into the spec and set it into stone, then attempt for everyone to accept it, then I definitelly disagree. So, please, could you clarify the meaning of that statement? Thanks. One other detail that it seems not many people have picked up on yet is that microdata proposes a DOM API to extract microdata-based info from a live document on the client side. In my opinion this is huge and has the potential to greatly increase author interest in semantic markup. Allright, an API may be a benefit. Most probably it is. However, a similar API could have been built from RDFa, or eRDF, or EASE, or any other already existing or new solution; so it doesn't justify creating a new syntax. I have to insist: which are the benefits from such built-from-the-ground, restrictive *syntax*? That's what we need to know to evaluate it against its costs. Now, it may be that microdata will ultimately fail, either because it is outcompeted by RDFa, or because not enough people care about semantic markup, or whatever. But at least for now, I don't see a reason to strangle it in the cradle. At least for now, I don't see a reason why it was created to begin with. Maybe if somebody could enlighten us with this detail, this discussion could evolve into something more useful and productive. On Fri, May 15, 2009 at 6:53 AM, Maciej Stachowiak m...@apple.com wrote: On May 14, 2009, at 1:30 PM, Shelley Powers wrote: So, if I'm pushing for RDFa, it's not because I want to win. It's because I have things I want to do now, and I would like to make sure have a reasonable chance of working a couple of years in the future. And yeah, once SVG is in HTML5, and RDFa can work with HTML5, maybe I wouldn't mind giving old HTML a try again. Lord knows I'd like to user ampersands again. It sounds like your argument comes down to this: you have personally invested in RDFa, therefore having a competing technology is bad, regardless of the technical merits. Pause, please. Before going on, I need to ask again: which are those technical merits?? I don't mean to parody here - I am somewhat sympathetic to this line of argument. I think I'm interpreting Shelley's argument slightly differently. She didn't chose RDFa because it was better than microdata. She chose RDFa because it was better than other options, and microdata didn't even exist yet. Now microdata comes out, some drawbacks are highlighted in comparison with RDFa (lack of typing, inability to depict the full RDF model, Reversed domains are as ugly as CURIEs (but at least CURIEs resolve to something useful, while reversed domains often don't resolve at all), and you ask RDFa proponents to give microdata a chance, to not strangle it in the cradle; but nobody seems willing to answer the one question: what does microdata provide to make up for its drawbacks? Often pragmatic concerns mean that an incremental improvement just isn't worth the cost of switching Wait. Are you refering to microdata as an incremental improvement over RDFa?? IMO, it's rather a decremental enworsement. My personally judgment is that we're not past the point of no return on data embedding. There's microformats, RDFa, and then dozens of other serializations of
Re: [whatwg] Annotating structured data that HTML has no semanticsfor
I do not think anybody in WHATWG hates the CURIE tool; however, the following problems have been put forward: Copy-Paste The CURIE mechanism is considered inconvenient because is not copy-paste-resilient, and the associated risk is that semantic elements would randomly change their meaning. Link rot CURIE definitions can only be looked up while the CURIE server is providing them; the chance of the URL becoming broken is high for home-brewed vocabularies. While the vocabularies can be moved elsewhere, it will not always be possible to create a redirect. Chris
Re: [whatwg] Annotating structured data that HTML has no semantics for
Maciej Stachowiak wrote: On May 14, 2009, at 1:30 PM, Shelley Powers wrote: So, if I'm pushing for RDFa, it's not because I want to win. It's because I have things I want to do now, and I would like to make sure have a reasonable chance of working a couple of years in the future. And yeah, once SVG is in HTML5, and RDFa can work with HTML5, maybe I wouldn't mind giving old HTML a try again. Lord knows I'd like to user ampersands again. It sounds like your argument comes down to this: you have personally invested in RDFa, therefore having a competing technology is bad, regardless of the technical merits. I don't mean to parody here - I am somewhat sympathetic to this line of argument. Often pragmatic concerns mean that an incremental improvement just isn't worth the cost of switching (for example HTML vs. XHTML). My personally judgment is that we're not past the point of no return on data embedding. There's microformats, RDFa, and then dozens of other serializations of RDF (some of which you cited). This doesn't seem like a space on the verge of picking a single winner, and the players seem willing to experiment with different options. There are not dozens of other serializations of RDF. The point I was trying to make is, I'd rather put my time into something that exists now, than have to watch the wheel re-invented. I'd rather see semantic metadata become a reality. I'm glad that you personally feel that companies will be just peachy keen on having to support multiple parsers to get the same data. On the HTML WG side, I will never support microdata, because no case has been made for its existence. The point is, people in the real world have to use this stuff. It helps them if they have one, generally agreed on approach. As it is, folks have to contend with both RDFa and microformats, but at least we know these have different purposes. From my cursory study, I think microdata could subsume many of the use cases of both microformats and RDFa. It seems to me that it avoids much of what microformats advocates find objectionable, and provides a good basis for new microformats; but at the same time it seems it can represent a full RDF data model. Thus, I think we have the potential to get one solution that works for everyone. I'm not 100% sure microdata can really achieve this, but I think making the attempt is a positive step. It can't, don't you see? Microdata will only work in HTML5/XHTML5. XHTML 1.1 and yes, 2.0 will be around for years, decades. In addition, XHTML5 already supports RDFa. Supporting XHTML 1.1 has about 0.001% as much value as supporting text/html. XHTML 2.0 is completely irrelevant to the Web, and looks on track to remain so. So I don't find this point very persuasive. I don't think you'll find that the world is breathlessly waiting for HTML5. I think you'll find that XHTML 1.1 will have wider use than HTML5 for the next decade. If not longer. I wouldn't count out XHTML 2.0, either. And in a decade, a lot can change. Why you think something completely brand new, no vendor support, drummed up in a few hours or a day or so is more robust, and a better option than a mature spec in wide use, well frankly boggles my mind. I haven't evaluated it enough to know for sure (as I said). I do think avoiding CURIEs is extremely valuable from the point of view of sane text/html semantics and ease of authoring; and RDF experts seem to think it works fine for representing RDF data models. So tentatively, I don't see any gaping holes. If you see a technical problem, and not just potential competition for the technology you've invested in, then you should definitely cite it. I don't think CURIEs are that difficult, nor impossible no matter the arguments that Henri brings out. I am impressed with your belief in HTML5. But One other detail that it seems not many people have picked up on yet is that microdata proposes a DOM API to extract microdata-based info from a live document on the client side. In my opinion this is huge and has the potential to greatly increase author interest in semantic markup. Not really. Can do this now with RDFa in XHTML. And I don't need any new DOM to do it. The power of semantic markup isn't really seen until you take that markup data _outside_ the document. And merge that data with data from other documents. Google rich snippets. Yahoo searchmonkey. Heck, even an application that manages the data from different subsites of one domain. I respectfully disagree. An API to do things client-side that doesn't require an external library is extremely powerful, because it lets content authors easily make use of the very same semantic markup that they are vending for third parties, so they have more incentive to use it and get it right. Sure, we'll have to disagree on this one. Now, it may be that microdata will ultimately fail, either because it is outcompeted by RDFa,
Re: [whatwg] Annotating structured data that HTML has no semanticsfor
On 15/5/09 14:11, Shelley Powers wrote: Kristof Zelechovski wrote: I do not think anybody in WHATWG hates the CURIE tool; however, the following problems have been put forward: Copy-Paste The CURIE mechanism is considered inconvenient because is not copy-paste-resilient, and the associated risk is that semantic elements would randomly change their meaning. Well, no, the elements won't randomly change their meaning. The only risk is copying and pasting them into a document that doesn't provide namespace definitions for the prefixes. Are you thinking that someone will be using different namespaces but the same prefix? Come on -- do you really think that will happen? The most likely case is with Dublin Core, but DC data varies enough already that this isn't too destructive... Dan
Re: [whatwg] Annotating structured data that HTML has no semanticsfor
On Fri, May 15, 2009 at 1:44 PM, Kristof Zelechovski giecr...@stegny.2a.pl wrote: I do not think anybody in WHATWG hates the CURIE tool; however, the following problems have been put forward: Copy-Paste The CURIE mechanism is considered inconvenient because is not copy-paste-resilient, and the associated risk is that semantic elements would randomly change their meaning. Copy-paste issues with RDFa and similar syntaxes can take two forms: The first is horfaned prefixes: when metadata with a given prefix is copied, but then it's pasted in a context where the prefix is not defined. If the user that is copy-pasting this stuff really cares about metadata, s/he would review the code and make the relevant fixes and/or copy the prefix declarations; the same way when an author is copy-pasting content and wants to preserve formatting s/he would copy the CSS stuff. If the user doesn't actually care about the metadata, then there is no harm, because properties relying on an unmapped prefix should yield no RDF output at all. The second form is prefix clashes: this is actually extremely rare. For example, someone copies code with FOAF metadata, and then pastes it on another page: which are the chances that user will be using a foaf: prefix for something else than FOAF? Sure, there are cases where a clash might happen but, again, these are only likely to appear on pages by authors who have some idea about metadata, and hence the author is more than likely to review the code being pasted to prevent these and other clashes (such as classes that would mean something completelly different under the new page's CSS code, element id clashes, etc). A last possibility is that the author doesn't have any idea about metadata at all, but is using a CMS that relies on metadata. In such case, it would be wise on the CMS's part to pre-process code fragments and either map the prefix to what they mean (if it's obvious) or remove the invalid data (if the CMS can't figure out what it should mean). Link rot CURIE definitions can only be looked up while the CURIE server is providing them; the chance of the URL becoming broken is high for home-brewed vocabularies. While the vocabularies can be moved elsewhere, it will not always be possible to create a redirect. Oh, and do reversed domains help at all with this? Ok, with CURIEs there is a (relatively small) chance for the CURIE to not be resolvable at a given time; reversed domains have a 100% chance to not be resolvable at any time: there is always, at least, ambiguity: does org.example.foo map to foo.example.org, example.org/foo, or example.org#foo? Even better: what if, under example.org we find a vocabulary at example.org/foo and another at foo.example.org? (Ok, that'd be quite unwise, although it might be a legitimate way to keep deployed and test versions of a vocabulary online at a time; but anyway CURIEs can cope with it, while reversed domains can't). Wherever there are links, there is a chance for broken links: that's part of the nature of links, and the evolving nature of the web. But, just because the chance of links being broken, would you deny the utility of elements such as a and link? Reversed domains don't face broken links because they are simply uncapable to link to anything. Now, please, I'd appreciate if you reviewed your arguments before posting them: while the copy-paste issue is a legitimate argument, and now we can consider whether this copy-paste-resilience is worth the costs of microdata, that link rot argument is just a waste of everyone's time, including yours. Anyway, thanks for that first argument: that's exactly what I was asking for in the hope of letting this discussion advance somewhere. So, before we start comparing benefits against costs, can someone post anymore benefits or does the copy-paste-resilience point stand alone against all the costs and possible issues? Regards, Eduard Pascual
Re: [whatwg] Annotating structured data that HTML has no semantics for
On Thu, 14 May 2009 22:30:41 +0200, Shelley Powers shell...@burningbird.net wrote: I'm not 100% sure microdata can really achieve this, but I think making the attempt is a positive step. It can't, don't you see? Microdata will only work in HTML5/XHTML5. Actually, as specified, it would work for any text/html and any XHTML content. It would just be valid in (X)HTML5, but it would work even if the input is not valid (X)HTML5 or looks like HTML4 or XHTML 1.1. XHTML 1.1 and yes, 2.0 will be around for years, decades. In addition, XHTML5 already supports RDFa. XHTML5 supports RDFa to the same extent that XHTML 1.1 supports microdata (in both cases, it would work but is not valid). -- Simon Pieters Opera Software
Re: [whatwg] Annotating structured data that HTML has nosemanticsfor
Links do not contribute to the behavior the meaning of the text contained within them and not to its meaning. which does not depend on whether the link is broken or not. Moreover, whether the linked resource can be retrieved at all depends on the URI scheme, as in href=mailto:u...@domain;. The advertised advantage of CURIE prefixes is that the metadata declaration can be retrieved and looked up, and that can influence the meaning of the text thus marked. Therefore, link rot is a bigger problem for CURIE prefixes than for links. I think the original URL corresponding to a reversed domain prefix is irrelevant, and attempts to reconstruct it are futile anyway. Nonexistent features are better than features that decay progressively, at least as far as a specification is concerned. Best regards, Chris
[whatwg] Link rot is not dangerous (was: Re: Annotating structured data that HTML has nosemanticsfor)
Kristof Zelechovski wrote: Therefore, link rot is a bigger problem for CURIE prefixes than for links. There have been a number of people now that have gone to great lengths to outline how awful link rot is for CURIEs and the semantic web in general. This is a flawed conclusion, based on the assumption that there must be a single vocabulary document in existence, for all time, at one location. This has also lead to a false requirement that all vocabularies should be centralized. Here's the fear: If a vocabulary document disappears for any reason, then the meaning of the vocabulary is lost and all triples depending on the lost vocabulary become useless. That fear ignores the fact that we have a highly available document store available to us (the Web). Not only that, but these vocabularies will be cached (at Google, at Yahoo, at The Wayback Machine, etc.). IF a vocabulary document disappears, which is highly unlikely for popular vocabularies - imagine FOAF disappearing overnight, then there are alternative mechanisms to extract meaning from the triples that will be left on the web. Here are just two of the possible solutions to the problem outlined: - The vocabulary is restored at another URL using a cached copy of the vocabulary. The site owner of the original vocabulary either re-uses the vocabulary, or re-directs the vocabulary page to another domain (somebody that will ensure the vocabulary continues to be provided - somebody like the W3C). - RDFa parsers can be given an override list of legacy vocabularies that will be loaded from disk (from a cached copy). If a cached copy of the vocabulary cannot be found, it can be re-created from scratch if necessary. The argument that link rot would cause massive damage to the semantic web is just not true. Even if there is minor damage caused, it is fairly easy to recover from it, as outlined above. -- manu -- Manu Sporny President/CEO - Digital Bazaar, Inc. blog: A Collaborative Distribution Model for Music http://blog.digitalbazaar.com/2009/04/04/collaborative-music-model/
Re: [whatwg] Link rot is not dangerous
On 15/5/09 18:20, Manu Sporny wrote: Kristof Zelechovski wrote: Therefore, link rot is a bigger problem for CURIE prefixes than for links. There have been a number of people now that have gone to great lengths to outline how awful link rot is for CURIEs and the semantic web in general. This is a flawed conclusion, based on the assumption that there must be a single vocabulary document in existence, for all time, at one location. This has also lead to a false requirement that all vocabularies should be centralized. Here's the fear: If a vocabulary document disappears for any reason, then the meaning of the vocabulary is lost and all triples depending on the lost vocabulary become useless. That fear ignores the fact that we have a highly available document store available to us (the Web). Not only that, but these vocabularies will be cached (at Google, at Yahoo, at The Wayback Machine, etc.). IF a vocabulary document disappears, which is highly unlikely for popular vocabularies - imagine FOAF disappearing overnight, then there are alternative mechanisms to extract meaning from the triples that will be left on the web. Here are just two of the possible solutions to the problem outlined: - The vocabulary is restored at another URL using a cached copy of the vocabulary. The site owner of the original vocabulary either re-uses the vocabulary, or re-directs the vocabulary page to another domain (somebody that will ensure the vocabulary continues to be provided - somebody like the W3C). - RDFa parsers can be given an override list of legacy vocabularies that will be loaded from disk (from a cached copy). If a cached copy of the vocabulary cannot be found, it can be re-created from scratch if necessary. The argument that link rot would cause massive damage to the semantic web is just not true. Even if there is minor damage caused, it is fairly easy to recover from it, as outlined above. A few other points: 1. It's for the community of vocabulary-creators to help each other out w.r.t. hosting/publishing these: I just nudged a friend to put another 5 years on the DNS rental for a popular namespace. I think we should put a bit more structure around these kinds of habit, so that popular namespaces won't drop off the Web through accident. 2. digitally signing the schemas will become part of the story, I'm sure. While it's a bit fiddly, there are advantages to having other mechanisms beyond URI de-referencing for knowing where a schema came from 3. Parties worried about external dependencies when using namespaces can always indirect through their own namespace, whose schema document can declare subclass/subproperty relations to other URIs cheers Dan
Re: [whatwg] Link rot is not dangerous (was: Re: Annotating structured data that HTML has nosemanticsfor)
I understand that there are ways to recover resources that disappear from the Web; however, the postulated advantage of RDFa you can go see what it means simply does not hold. The recovery mechanism, Web search/cache, would be as good for CURIE URL as for domain prefixes. Creating a redirect is not always possible and the built-in redirect dictionary (CURIE catalog?) smells of a central repository. This is no better than public entity identifiers in XML. Serving the vocabulary from the own domain is not always possible, e.g. in case of reader-contributed content, and only guarantees that the vocabulary will be alive while it is supported by the domain owner. (WHATWG wants HTML documents to be readable 1000 years from now.) It is not always practical either as it could confuse URL-based tools that do not retrieve the resources referenced. All this does not imply, of course, that RDFa is no good. It is only intended to demonstrate that the postulated advantage of the CURIE lookup is wishful thinking. Best regards, Chris
Re: [whatwg] Link rot is not dangerous
Dan Brickley wrote: On 15/5/09 18:20, Manu Sporny wrote: Kristof Zelechovski wrote: Therefore, link rot is a bigger problem for CURIE prefixes than for links. There have been a number of people now that have gone to great lengths to outline how awful link rot is for CURIEs and the semantic web in general. This is a flawed conclusion, based on the assumption that there must be a single vocabulary document in existence, for all time, at one location. This has also lead to a false requirement that all vocabularies should be centralized. Here's the fear: If a vocabulary document disappears for any reason, then the meaning of the vocabulary is lost and all triples depending on the lost vocabulary become useless. That fear ignores the fact that we have a highly available document store available to us (the Web). Not only that, but these vocabularies will be cached (at Google, at Yahoo, at The Wayback Machine, etc.). IF a vocabulary document disappears, which is highly unlikely for popular vocabularies - imagine FOAF disappearing overnight, then there are alternative mechanisms to extract meaning from the triples that will be left on the web. Here are just two of the possible solutions to the problem outlined: - The vocabulary is restored at another URL using a cached copy of the vocabulary. The site owner of the original vocabulary either re-uses the vocabulary, or re-directs the vocabulary page to another domain (somebody that will ensure the vocabulary continues to be provided - somebody like the W3C). - RDFa parsers can be given an override list of legacy vocabularies that will be loaded from disk (from a cached copy). If a cached copy of the vocabulary cannot be found, it can be re-created from scratch if necessary. The argument that link rot would cause massive damage to the semantic web is just not true. Even if there is minor damage caused, it is fairly easy to recover from it, as outlined above. A few other points: 1. It's for the community of vocabulary-creators to help each other out w.r.t. hosting/publishing these: I just nudged a friend to put another 5 years on the DNS rental for a popular namespace. I think we should put a bit more structure around these kinds of habit, so that popular namespaces won't drop off the Web through accident. 2. digitally signing the schemas will become part of the story, I'm sure. While it's a bit fiddly, there are advantages to having other mechanisms beyond URI de-referencing for knowing where a schema came from 3. Parties worried about external dependencies when using namespaces can always indirect through their own namespace, whose schema document can declare subclass/subproperty relations to other URIs cheers Dan The most important point to take from all of this, though, is that link rot within the RDF world is an extremely rare and unlikely occurrence. I've been working with RDF for close to a decade, and link rot has never been an issue. One of the very first uses of RDF, in RSS 1.0, for feeds, is still in existence, still viable. You don't have to take my word, check it out yourselves: http://purl.org/rss/1.0/ Even if, and I want to strongly emphasize if link rot does occur, both Manu and Dan have demonstrated multiple ways of ensuring that no meaning is lost, and nothing is broken. However, I hope that people are open enough to take away from their discussions that they are trying to treat this concern respectfully, and trying to demonstrate that there's more than one solution. Not that this forms a proof that Oh my god, if we use RDF, we're doomed! Also don't lose sight that this is really no more serious an issue than, say, a company originating com.sun.* being purchased by another company, named com.oracle.*. And you can't say, Well that's not the same, because it is. The only safe bet is to designate some central authority and give them power over every possible name. Then we run the massive risk of this system failing (and this applies to microdata's reverse DNS as well as RDF's URI), or it being taken over by an entity that sees such a data store as a way to make a great profit. We also defeat the very principle on which semantic data on the web abides, and that's true whether you're support microdata or RDF. Shelley
Re: [whatwg] Link rot is not dangerous
Classes in com.sun.* are reserved for Java implementation details and should not be used by the general public. CURIE URL are intended for general use. So, I can say Well, it is not the same, because it is not. Cheers, Chris
Re: [whatwg] Link rot is not dangerous
Kristof Zelechovski wrote: I understand that there are ways to recover resources that disappear from the Web; however, the postulated advantage of RDFa you can go see what it means simply does not hold. This is a strawman argument more below... All this does not imply, of course, that RDFa is no good. It is only intended to demonstrate that the postulated advantage of the CURIE lookup is wishful thinking. That train of logic seems to falsely conclude that if something does not hold true 100% of the time, then it cannot be counted as an advantage. Example: Since the postulated advantage of RAID-5 is that a disk array is unlikely to fail due to a single disk failure, and since it is possible for more than one disk to fail before a recovery is complete, one cannot call running a disk array in RAID-5 mode an advantage to not running RAID at all (because failure is possible). or Since the postulated advantage of CURIEs is that you can go see what it means and it is possible for a CURIE defined URL to be unavailable, one cannot call it an advantage because it may fail. There are two flaws in the premises and reasoning above, for the CURIE case: - It is assumed that for something to be called an 'advantage' that it must hold true 100% of the time. - It is assumed that most proponents of RDFa believe that you can go see what it means holds at all times - one would have to be very deluded to believe that. The recovery mechanism, Web search/cache, would be as good for CURIE URL as for domain prefixes. Creating a redirect is not always possible and the built-in redirect dictionary (CURIE catalog?) smells of a central repository. Why does having a file sitting on your local machine that lists alternate vocabulary files for CURIEs smell of a central repository? Perhaps you're assuming that the file would be managed by a single entity? If so, it wouldn't need to be and that was not what I was proposing. Serving the vocabulary from the own domain is not always possible, e.g. in case of reader-contributed content, This isn't clear, could you please clarify what you mean by reader-contributed content? and only guarantees that the vocabulary will be alive while it is supported by the domain owner. This case and it's solution was already covered previously. Again - if the domain owner disappears, the domain disappears, or the domain owner doesn't want to cooperate for any reason, one could easily set up an alternate URL and instruct the RDFa processor to re-direct any discovered CURIEs that match the old vocabulary to the new (referenceable) vocabulary. (WHATWG wants HTML documents to be readable 1000 years from now.) Is that really a requirement? What about external CSS files that disappear? External Javascript files that disappear? External SVG files that disappear? All those have something to do with the document's human/machine readability. Why is HTML5 not susceptible to link rot in the same way that RDFa is susceptible to link rot? Also, why 1000 years, that seems a bit arbitrary? =P It is not always practical either as it could confuse URL-based tools that do not retrieve the resources referenced. Could you give an example of this that wouldn't be a bug in the dereferencing application? How could a non-dereference-able URL confuse URL-based tools? -- manu -- Manu Sporny President/CEO - Digital Bazaar, Inc. blog: A Collaborative Distribution Model for Music http://blog.digitalbazaar.com/2009/04/04/collaborative-music-model/
Re: [whatwg] Annotating structured data that HTML has no semanticsfor
On Fri, May 15, 2009 at 9:17 AM, Eduard Pascual herenva...@gmail.com wrote: On Fri, May 15, 2009 at 1:44 PM, Kristof Zelechovski Link rot CURIE definitions can only be looked up while the CURIE server is providing them; the chance of the URL becoming broken is high for home-brewed vocabularies. While the vocabularies can be moved elsewhere, it will not always be possible to create a redirect. Oh, and do reversed domains help at all with this? Ok, with CURIEs there is a (relatively small) chance for the CURIE to not be resolvable at a given time; reversed domains have a 100% chance to not be resolvable at any time: there is always, at least, ambiguity: does org.example.foo map to foo.example.org, example.org/foo, or example.org#foo? Even better: what if, under example.org we find a vocabulary at example.org/foo and another at foo.example.org? (Ok, that'd be quite unwise, although it might be a legitimate way to keep deployed and test versions of a vocabulary online at a time; but anyway CURIEs can cope with it, while reversed domains can't). Wherever there are links, there is a chance for broken links: that's part of the nature of links, and the evolving nature of the web. But, just because the chance of links being broken, would you deny the utility of elements such as a and link? Reversed domains don't face broken links because they are simply uncapable to link to anything. Reversed domains aren't *meant* to link to anything. They shouldn't be parsed at all. They're a uniquifier so that multiple vocabularies can use the same terms without clashing or ambiguity. The Microdata proposal also allows normal urls, but they are similarly nothing more than a uniquifier. CURIEs, at least theoretically, *rely* on the prefix lookup. After all, how else can you tell that a given relation is really the same as, say, foaf:name? If the domain isn't available, the data will be parsed incorrectly. That's why link rot is an issue. ~TJ
[whatwg] Link rot is not dangerous
Tab Atkins Jr. wrote: Reversed domains aren't *meant* to link to anything. They shouldn't be parsed at all. They're a uniquifier so that multiple vocabularies can use the same terms without clashing or ambiguity. The Microdata proposal also allows normal urls, but they are similarly nothing more than a uniquifier. CURIEs, at least theoretically, *rely* on the prefix lookup. After all, how else can you tell that a given relation is really the same as, say, foaf:name? If the domain isn't available, the data will be parsed incorrectly. That's why link rot is an issue. Where in the CURIE spec does it state or imply that if a domain isn't available, that the resulting parsed data will be invalid? -- manu -- Manu Sporny President/CEO - Digital Bazaar, Inc. blog: A Collaborative Distribution Model for Music http://blog.digitalbazaar.com/2009/04/04/collaborative-music-model/
Re: [whatwg] Link rot is not dangerous
Serving the RDFa vocabulary from the own domain is not always possible, e.g. when a reader of a Web site is encouraged to post a comment to the page she reads and her comment contains semantic annotations. The probability of a URL becoming unavailable is much greater than that of both mirrored drives wearing out at the same time. (data mirroring does not claim it protects from fire, water, high voltage, magnetic storms, earthquakes and the like; it only protects you from natural wear.) The probability of ultimately losing data stored in one copy is 1; the probability of a URL going down is close to 1. So, RAID works in most cases, CURIE URL do not (ultimately) work in most cases. Disappearing CSS is not a problem for HTML because CSS does not affect the meaning of the page. Disappearing scripts are a problem for HTML but they are not a problem for HTML *data*. In other words, script-generated content is not guaranteed to survive, and there is nothing we can do about that except for a warning. Such content cannot be HTML-validated either. In general, scripts are best used (and intended) for behavior, not for creating content. External SVG files do not describe existing content, they *are* (embedded) content. If a HTML file disappears, it becomes unreadable as well, but that problem obviously cannot be solved from within HTML :-) HTML should be readable in 1000 years from now was an attempt to visualize the intention of persistence. It should not be understood as best before, of course. If the author chooses to create a redirect to a well-known vocabulary using a dependent vocabulary stored at his own site in order to prevent link rot, tools that recognize vocabulary URL without reading the corresponding resources will be unable to recognize the author's intent, and for the tools that do read the original vocabulary will still be unavailable, so this method causes more problems than it solves. Cheers, Chris
Re: [whatwg] Link rot is not dangerous
Kristof Zelechovski wrote: Classes in com.sun.* are reserved for Java implementation details and should not be used by the general public. CURIE URL are intended for general use. So, I can say Well, it is not the same, because it is not. Cheers, Chris But we're not dealing with Java anymore. We're dealing with using reversed DNS concatenated with some kind of default URI, to create some kind of bastardized URL, which actually is valid, though incredibly painful to see, and can be implied to actually take one to to a web address. You don't have to take my word for it -- check out Philip's testing demo for microdata. You get triples with the following: http://www.w3.org/1999/xhtml/custom#com.damowmow.cat http://philip.html5.org/demos/microdata/demo.html#output_ntriples Not only do you face problems with link rot, you also face a significant amount of confusion, as people look at that and go, What the hell is that? Oh, and you can say, Well, but we don't _mean_ anything by it -- but what does that have to do with anything? People don't go running the spec everytime they see something. They look at this thing and think, Oh, a link. I wonder where it goes. You go ahead and try it, and imagine for a moment the confusion when it goes absolutely no where. Except that I imagine the W3C folks are getting a little annoyed with the HTML WG now, for allowing this type of thing in, generating a whole bunch of 404 errors for the web master(s). But hey, you've given me another idea. I think I'll create my own vocabulary items, with the reversed DNS http://www.w3.org/1999/xhtml/custom#com.sun.*. No, maybe http://www.w3.org/1999/xhtml/custom#com.opera.*. Nah, how about http://www.w3.org/1999/xhtml/custom#com.microsoft.*. Yeah, that's cool. And there is no mechanism is place to prevent this, because unlike regular URIs, where the domain is actually controlled by specific entity, you've created the world famous W3C fudge pot. Anything goes. I can't wait for the lawsuits on this one. You think that cybersquatting is an issue on the web, or facebook, or Twitter, wait until you see people use com.microsoft.*. Then there's the vocabulary that was created by foobar.com, that people think, Hey, cool, I'll use that...whatever it is. After all, if you want to play with the RDF kids, your vocabularies have to be usable by other people. But Foobar takes a dive in the dot com pool, and foobar.com gets taken over by a porn establishment. Yeah, I can't wait for people to explain that one to the boss. Just because it doesn't link, won't mean it won't end up on Twitter as a big, huge joke. If you want to find something to criticize, I think it's important to realize that hey, folks, you've just stepped over the line, and you're now in the Zone of Decentralization. Whatever impacts us, babes, impacts all of you. Because if you look at Philip's example, you're going to see the same set of vocabulary URIs we're using for RDF right now, as microdata uses our stuff, too. Including the links that are all trembling on the edge on the self-implosion. So the point of all of this is moot. But it was fun. Really fun. Have a great weekend. Shelley
Re: [whatwg] Link rot is not dangerous
On Fri, May 15, 2009 at 6:25 PM, Shelley Powers shell...@burningbird.net wrote: The most important point to take from all of this, though, is that link rot within the RDF world is an extremely rare and unlikely occurrence. That seems to be untrue in practice - see http://philip.html5.org/data/rdf-namespace-status.txt The source data is the list of common RDF namespace URIs at http://ebiquity.umbc.edu/resource/html/id/196/Most-common-RDF-namespaces from three years ago. Out of those 284: * 56 are 404s. (Of those, 37 end with '#', so that URI itself really ought to exist. In the other cases, it'd be possible that only the prefix+suffix URIs are meant to exist. Some of the cases are just typos, but I'm not sure how many.) * 2 are Forbidden. (Of those, 1 looks like a typo.) * 2 are Bad Gateway. * 22 could not connect to the server. (Of those, 2 weren't http:// URIs, and 1 was a typo. The others represent 13 different domains.) (For the URIs which returned Redirect responses, I didn't check what happens when you request the URI it redirected to, so there may be more failures.) Over a quarter of the most common namespace URIs don't resolve successfully today, and most of those look like they should have resolved when they were originally used, so link rot seems to be common. (Major vocabularies like RSS and FOAF are likely to exist for a long time, but they're the easiest cases to handle - we could just pre-define the prefixes rss: and foaf: and have a centralised database mapping them onto schemas/documentation/etc. It seems to me that URIs are most valuable to let any tiny group make one for their rarely-used vocabulary, and be guaranteed no name collisions without needing to communicate with a centralised registry to ensure uniqueness; but it's those cases that are most vulnerable to link rot, and in practice the links appear to fail quite often.) (I'm not arguing that link rot is dangerous - just that the numbers indicate it's a common situation rather than an extremely rare exception.) -- Philip Taylor exc...@gmail.com
Re: [whatwg] Link rot is not dangerous
On Fri, May 15, 2009 at 1:32 PM, Manu Sporny mspo...@digitalbazaar.com wrote: Tab Atkins Jr. wrote: Reversed domains aren't *meant* to link to anything. They shouldn't be parsed at all. They're a uniquifier so that multiple vocabularies can use the same terms without clashing or ambiguity. The Microdata proposal also allows normal urls, but they are similarly nothing more than a uniquifier. CURIEs, at least theoretically, *rely* on the prefix lookup. After all, how else can you tell that a given relation is really the same as, say, foaf:name? If the domain isn't available, the data will be parsed incorrectly. That's why link rot is an issue. Where in the CURIE spec does it state or imply that if a domain isn't available, that the resulting parsed data will be invalid? Assume a page that uses both foaf and another vocab that subclasses many foaf properties. Given working lookups for both, the rdf parser can determine that two entries with different properties are really 'the same', and hopefully act on that knowledge. If the second vocab 404s, that information is lost. The parser will then treat any use of that second vocab completely separately from the foaf, losing valuable semantic information. (Please correct any misunderstandings I may be operating under; I'm not sure how competent parsers currently are, and thus how much they'd actually use a working subclassed relation.) ~TJ
Re: [whatwg] Link rot is not dangerous
Philip Taylor wrote: On Fri, May 15, 2009 at 6:25 PM, Shelley Powers shell...@burningbird.net wrote: The most important point to take from all of this, though, is that link rot within the RDF world is an extremely rare and unlikely occurrence. That seems to be untrue in practice - see http://philip.html5.org/data/rdf-namespace-status.txt The source data is the list of common RDF namespace URIs at http://ebiquity.umbc.edu/resource/html/id/196/Most-common-RDF-namespaces from three years ago. Out of those 284: * 56 are 404s. (Of those, 37 end with '#', so that URI itself really ought to exist. In the other cases, it'd be possible that only the prefix+suffix URIs are meant to exist. Some of the cases are just typos, but I'm not sure how many.) * 2 are Forbidden. (Of those, 1 looks like a typo.) * 2 are Bad Gateway. * 22 could not connect to the server. (Of those, 2 weren't http:// URIs, and 1 was a typo. The others represent 13 different domains.) (For the URIs which returned Redirect responses, I didn't check what happens when you request the URI it redirected to, so there may be more failures.) Over a quarter of the most common namespace URIs don't resolve successfully today, and most of those look like they should have resolved when they were originally used, so link rot seems to be common. (Major vocabularies like RSS and FOAF are likely to exist for a long time, but they're the easiest cases to handle - we could just pre-define the prefixes rss: and foaf: and have a centralised database mapping them onto schemas/documentation/etc. It seems to me that URIs are most valuable to let any tiny group make one for their rarely-used vocabulary, and be guaranteed no name collisions without needing to communicate with a centralised registry to ensure uniqueness; but it's those cases that are most vulnerable to link rot, and in practice the links appear to fail quite often.) (I'm not arguing that link rot is dangerous - just that the numbers indicate it's a common situation rather than an extremely rare exception.) Philip, I don't think the occurrence of link rot causing problems in the RDF world is all that common, but thanks for looking up this data. Actually I will probably quote your info on my next writing at my weblog. I'd like to be dropped from any additional emails in this thread. After all, I have it on good authority I'm not open for rational discussion. So I'll leave this type of thing to you guys. Thanks Shelley
Re: [whatwg] Link rot is not dangerous
The problem of cybersquatting of oblique domains is, I believe, described and addressed in tag URI scheme definition [RFC4151], which I think is something rather similar to the constructs used for HTML microdata. I think that document is relevant not only to this discussion but to the whole concept. IMHO, Chris
Re: [whatwg] Annotating structured data that HTML has no semantics for
On Wed, May 13, 2009 at 10:04 AM, Leif Halvard Silli l...@malform.no wrote: Toby Inkster on Wed May 13 02:19:17 PDT 2009: Leif Halvard Silli wrote: Hear hear. Lets call it Cascading RDF Sheets. http://buzzword.org.uk/2008/rdf-ease/spec http://buzzword.org.uk/2008/rdf-ease/reactions I have actually implemented it. It works. Oh! Thanks for sharing. Indeed, RDF-EASE seems fairly nice! RDFa is better though. What does 'better' mean in this context? Why and how? Because it is easier to process? But EASE seems more compatible with microformats, and is better in that sense. I'd also like clarification here. I dislike *all* of the inline metadata proposals to some degree, for the same reasons that I dislike inline @style and @onfoo handlers. A Selector-based way of applying semantics fits my theoretical needs much better. I read all the reactions you pointed to. Some made the claim that EASE would move semantics out of the HTML file, and that microformats was better as it keeps the semantics inside the file. But I of course agree with you that EASE just underline/outline the semantics already in the file. Yup. The appropriate critique of separated metadata is that the *data* is moved out of the document, where it will inevitably decay compared to the live document. RDF-EASE keeps all the data stored in the live document, and merely specifies how to extract it. The only way you can lose data then is by changing the html structure itself, which is much less common than just changing the content. From the EASE draft: All properties in RDF-EASE begin with the string -rdf-, as per §4.1.2.1 Vendor-specific extensions in [CSS21]. This allows RDF-EASE and CSS to be safely mixed in one file, [...] I wonder why you think it is so important to be able to mix CSS and EASE. It seems better to separate the two completely. I'm not thrilled with the mixture of CSS and metadata either. Just because it uses Selectors doesn't mean it needs to be specifiable alongside CSS. jQuery uses Selectors too, but it stays where it belongs. ^_^ (That being said, there's a plugin for it that allows you to specify js in your CSS, and it gets applied to the matching elements from the block's selector.) ~TJ