Re: [whatwg] [Fwd: Re: Helping people seaching for content filtered by license]

2009-05-15 Thread Smylers
Nils Dagsson Moskopp writes:

 Am Freitag, den 08.05.2009, 19:57 + schrieb Ian Hickson:
 
   * Tara runs a video sharing web site for people who want
 licensing information to be included with their videos. When
 Paul wants to blog about a video, he can paste a fragment of
 HTML provided by Tara directly into his blog. The video is
 then available inline in his blog, along with any licensing
 information about the video.
  
  This can be done with HTML5 today. For example, here is the markup you 
  could include to allow someone to embed a video on their site while 
  including the copyright or license information:
  
 figure
  video src=http://example.com/videodata/sJf-ulirNRk; controls
   a href=http://video.example.com/watch?v=sJf-ulirNRk;Watch/a
  /video
  legend
   Pillar post surgery, starting to heal.
   smallcopy; copyright 2008 Pillar. All Rights Reserved./small
  /legend
 /figure
 
 Seriously, I don't get it. Is there really so much entrenched (widely
 deployed, a mess, IE-style) software out there relying on @rel=license
 meaning license of a single main content blob

Merely using rel=license in the above example would not cause the
copyright message to be displayed to users.

 that an unambigous (read: machine-readable) writeup of part licenses
 is impossible ?

Why does the license information need to be machine-readable in this
case?  (It may need to be for a different scenario, but that would be
dealt with separately.)

  The example above shows this for a movie, but it works as well for a
  photo:
  
 figure
   img src=http://nearimpossible.com/DSCF0070-1-tm.jpg; alt=
   legend
Picture by Bob.
smalla 
  href=http://creativecommons.org/licenses/by-nc-sa/2.5/legalcode;Creative 
Commons Attribution-Noncommercial-Share Alike 2.5 Generic 
  License/a/small
   /legend
 /figure
 
 Can I infer from this that an a in a small inside a legend is
 some kind of microformat for licensing information ?

No.  But if a human sees a string that mentions © copyright or
license then she's likely to realize it's licencing information.  And
if it's placed next to a picture it's conventional to interpret that as
applying to a picture.  It's also conventional for such information to
be small, because it's usually not the main content the user is
interested in when choosing to view the page.

Magazines and the like have been using this convention for years,
without any need to explicitly define what indicates licensing
information, seemingly without any ambiguity or confusion.

Smylers


Re: [whatwg] [Fwd: Re: Helping people seaching for content filtered by license]

2009-05-15 Thread Eduard Pascual
On Fri, May 15, 2009 at 8:40 AM, Smylers smyl...@stripey.com wrote:
 Nils Dagsson Moskopp writes:

 Am Freitag, den 08.05.2009, 19:57 + schrieb Ian Hickson:

       * Tara runs a video sharing web site for people who want
         licensing information to be included with their videos. When
         Paul wants to blog about a video, he can paste a fragment of
         HTML provided by Tara directly into his blog. The video is
         then available inline in his blog, along with any licensing
         information about the video.
 [...]

 Why does the license information need to be machine-readable in this
 case?  (It may need to be for a different scenario, but that would be
 dealt with separately.)

It would need to be machine-readable for tools like
http://search.creativecommons.org/ to do their job: check the license
against the engine's built-in knowledge of some licenses, and figure
out if it is suitable for the usages the user has requested (like
search for content I can build upon or search for content I can use
commercialy). Ideally, a search engine should have enough with
finding the video on either Tara's site *or* Paul's blog for it to be
available for users.

Just my two cents.


Re: [whatwg] [Fwd: Re: Helping people seaching for content filtered by license]

2009-05-15 Thread Smylers
Eduard Pascual writes:

 On Fri, May 15, 2009 at 8:40 AM, Smylers smyl...@stripey.com wrote:
 
   Am Freitag, den 08.05.2009, 19:57 + schrieb Ian Hickson:
  
 * Tara runs a video sharing web site for people who want
   licensing information to be included with their videos.
   When Paul wants to blog about a video, he can paste a
   fragment of HTML provided by Tara directly into his blog.
   The video is then available inline in his blog, along
   with any licensing information about the video.
  
  Why does the license information need to be machine-readable in this
  case?  (It may need to be for a different scenario, but that would be
  dealt with separately.)
 
 It would need to be machine-readable for tools like
 http://search.creativecommons.org/ to do their job: check the license
 against the engine's built-in knowledge of some licenses, and figure
 out if it is suitable for the usages the user has requested (like
 search for content I can build upon or search for content I can use
 commercialy). Ideally, a search engine should have enough with
 finding the video on either Tara's site *or* Paul's blog for it to be
 available for users.

Yeah, that sounds plausible.  However that's what I meant by a
different scenario -- adding criteria to the above, specifically about
searching.  Hixie attempted to address this case too:

Admittedly, if this scenario is taken in the context of the
first scenario, meaning that Bob wants this image to be
discoverable through search, but doesn't want to include it on a
page of its own, then extra syntax to mark this particular image
up would be useful.
   
However, in my research I found very few such cases. In every
case where I found multiple media items on a single page with no
dedicated page, either every item was licensed identically and
was the main content of the page, or each item had its own
separate page, or the items were licensed under the same license
as the page. In all three of these cases, rel=license already
solves the problem today.

To which Nils responded:

   Relying on linked pages just to get licensing information would
   be, well, massive overhead. Still, you are right - most blogs
   using many pictures have dedicated pages.

It's perfectly valid to disagree with this being sufficient (I
personally have no view either way on the matter).  I was just
clarifying that the legend mark-up example wasn't attempting to address
this case, and wasn't proposing legendsmall (or whatever) as a
machine-readable microformat.

Smylers


Re: [whatwg] Annotating structured data that HTML has no semantics for

2009-05-15 Thread Eduard Pascual
On Thu, May 14, 2009 at 10:17 PM, Maciej Stachowiak m...@apple.com wrote:
 [...]
 From my cursory study, I think microdata could subsume many of the use cases
 of both microformats and RDFa.
Maybe. But microformats and RDFa can handle *all* of these cases.
Again, which are the benefits of creating something entirely new to
replace what already exists while it can't even handle all the cases
of what it is replacing? Both the new syntax, and the cases
restrictions, are costs: what are these costs buying? If it's not
clear what we are getting for these costs, it is impossible to
evaluate whether the costs are worth it or not.

 It seems to me that it avoids much of what microformats advocates find 
 objectionable
Could you specify, please? Do you mean anything else than WHATWG's
almost irrational hate toward CURIEs and everything that involves
prefixes?

 but at the same time it seems it can represent a full RDF data
 model.
No, it *can't* represent a full RDF model: it has already been shown
several times on this thread.

 Thus, I think we have the potential to get one solution that works for 
 everyone.
RDFa itself doesn't work for everyone; but microdata is even more
restricted: it leaves out the cases that RDFa leaves out, but it also
leaves out some cases that RDFa was able to handle. So, where do you
see such potential?

 I'm not 100% sure microdata can really achieve this, but I think making the
 attempt is a positive step.
What do you mean by making the attempt? If there is something
microdata can't handle, it won't be able to handle it without changing
the spec. If you meant that evolving that microdata proposal towards
something that works for everyone is a positive step, then I agree;
but if you meant that engraving this microdata approach into the spec
and set it into stone, then attempt for everyone to accept it, then I
definitelly disagree. So, please, could you clarify the meaning of
that statement? Thanks.

 One other detail that it seems not many people have picked up on yet is that
 microdata proposes a DOM API to extract microdata-based info from a live
 document on the client side. In my opinion this is huge and has the
 potential to greatly increase author interest in semantic markup.
Allright, an API may be a benefit. Most probably it is. However, a
similar API could have been built from RDFa, or eRDF, or EASE, or any
other already existing or new solution; so it doesn't justify creating
a new syntax. I have to insist: which are the benefits from such
built-from-the-ground, restrictive *syntax*? That's what we need to
know to evaluate it against its costs.

 Now, it may be that microdata will ultimately fail, either because it is
 outcompeted by RDFa, or because not enough people care about semantic
 markup, or whatever. But at least for now, I don't see a reason to strangle
 it in the cradle.
At least for now, I don't see a reason why it was created to begin
with. Maybe if somebody could enlighten us with this detail, this
discussion could evolve into something more useful and productive.

On Fri, May 15, 2009 at 6:53 AM, Maciej Stachowiak m...@apple.com wrote:

 On May 14, 2009, at 1:30 PM, Shelley Powers wrote:

 So, if I'm pushing for RDFa, it's not because I want to win. It's
 because I have things I want to do now, and I would like to make sure have a
 reasonable chance of working a couple of years in the future. And yeah, once
 SVG is in HTML5, and RDFa can work with HTML5, maybe I wouldn't mind giving
 old HTML a try again. Lord knows I'd like to user ampersands again.

 It sounds like your argument comes down to this: you have personally
 invested in RDFa, therefore having a competing technology is bad, regardless
 of the technical merits.
Pause, please. Before going on, I need to ask again: which are those
technical merits??

 I don't mean to parody here - I am somewhat sympathetic to this line of 
 argument.
I think I'm interpreting Shelley's argument slightly differently. She
didn't chose RDFa because it was better than microdata. She chose RDFa
because it was better than other options, and microdata didn't even
exist yet. Now microdata comes out, some drawbacks are highlighted in
comparison with RDFa (lack of typing, inability to depict the full RDF
model, Reversed domains are as ugly as CURIEs (but at least CURIEs
resolve to something useful, while reversed domains often don't
resolve at all), and you ask RDFa proponents to give microdata a
chance, to not strangle it in the cradle; but nobody seems willing
to answer the one question: what does microdata provide to make up for
its drawbacks?

 Often pragmatic concerns mean that an incremental improvement just isn't 
 worth the cost of switching
Wait. Are you refering to microdata as an incremental improvement over
RDFa?? IMO, it's rather a decremental enworsement.

 My personally judgment is that we're not past the point of
 no return on data embedding. There's microformats, RDFa, and then dozens of
 other serializations of 

Re: [whatwg] Annotating structured data that HTML has no semanticsfor

2009-05-15 Thread Kristof Zelechovski
I do not think anybody in WHATWG hates the CURIE tool; however, the
following problems have been put forward:

Copy-Paste
The CURIE mechanism is considered inconvenient because is not
copy-paste-resilient, and the associated risk is that semantic elements
would randomly change their meaning.

Link rot
CURIE definitions can only be looked up while the CURIE server is
providing them; the chance of the URL becoming broken is high for
home-brewed vocabularies.  While the vocabularies can be moved elsewhere, it
will not always be possible to create a redirect.

Chris





Re: [whatwg] Annotating structured data that HTML has no semantics for

2009-05-15 Thread Shelley Powers

Maciej Stachowiak wrote:


On May 14, 2009, at 1:30 PM, Shelley Powers wrote:

So, if I'm pushing for RDFa, it's not because I want to win. It's 
because I have things I want to do now, and I would like to make sure 
have a reasonable chance of working a couple of years in the future. 
And yeah, once SVG is in HTML5, and RDFa can work with HTML5, maybe I 
wouldn't mind giving old HTML a try again. Lord knows I'd like to 
user ampersands again.


It sounds like your argument comes down to this: you have personally 
invested in RDFa, therefore having a competing technology is bad, 
regardless of the technical merits. I don't mean to parody here - I am 
somewhat sympathetic to this line of argument. Often pragmatic 
concerns mean that an incremental improvement just isn't worth the 
cost of switching (for example HTML vs. XHTML). My personally judgment 
is that we're not past the point of no return on data embedding. 
There's microformats, RDFa, and then dozens of other serializations of 
RDF (some of which you cited). This doesn't seem like a space on the 
verge of picking a single winner, and the players seem willing to 
experiment with different options.



There are not dozens of other serializations of RDF.

The point I was trying to make is, I'd rather put my time into something 
that exists now, than have to watch the wheel re-invented. I'd rather 
see semantic metadata become a reality. I'm glad that you personally 
feel that companies will be just peachy keen on having to support 
multiple parsers to get the same data.


On the HTML WG side, I will never support microdata, because no case has 
been made for its existence.




The point is, people in the real world have to use this stuff. It 
helps them if they have one, generally agreed on approach. As it 
is, folks have to contend with both RDFa and microformats, but at 
least we know these have different purposes.


From my cursory study, I think microdata could subsume many of the 
use cases of both microformats and RDFa. It seems to me that it 
avoids much of what microformats advocates find objectionable, and 
provides a good basis for new microformats; but at the same time it 
seems it can represent a full RDF data model. Thus, I think we have 
the potential to get one solution that works for everyone.


I'm not 100% sure microdata can really achieve this, but I think 
making the attempt is a positive step.



It can't, don't you see?

Microdata will only work in HTML5/XHTML5. XHTML 1.1 and yes, 2.0 will 
be around for years, decades. In addition, XHTML5 already supports RDFa.


Supporting XHTML 1.1 has about 0.001% as much value as 
supporting  text/html. XHTML 2.0 is completely irrelevant to the Web, 
and looks on track to remain so. So I don't find this point very 
persuasive.


I don't think you'll find that the world is breathlessly waiting for 
HTML5. I think you'll find that XHTML 1.1 will have wider use than HTML5 
for the next decade. If not longer. I wouldn't count out XHTML 2.0, 
either.  And in a decade, a lot can change.


Why you think something completely brand new, no vendor support, 
drummed up in a few hours or a day or so is more robust, and a better 
option than a mature spec in wide use, well frankly boggles my mind.


I haven't evaluated it enough to know for sure (as I said). I do think 
avoiding CURIEs is extremely valuable from the point of view of sane 
text/html semantics and ease of authoring; and RDF experts seem to 
think it works fine for representing RDF data models. So tentatively, 
I don't see any gaping holes. If you see a technical problem, and not 
just potential competition for the technology you've invested in, then 
you should definitely cite it.


I don't think CURIEs are that difficult, nor impossible no matter the 
arguments that Henri brings out.


I am impressed with your belief in HTML5.

But
One other detail that it seems not many people have picked up on yet 
is that microdata proposes a DOM API to extract microdata-based info 
from a live document on the client side. In my opinion this is huge 
and has the potential to greatly increase author interest in 
semantic markup.




Not really. Can do this now with RDFa in XHTML. And I don't need any 
new DOM to do it.


The power of semantic markup isn't really seen until you take that 
markup data _outside_ the document. And merge that data with data 
from other documents. Google rich snippets. Yahoo searchmonkey. Heck, 
even an application that manages the data from different subsites of 
one domain.


I respectfully disagree. An API to do things client-side that doesn't 
require an external library is extremely powerful, because it lets 
content authors easily make use of the very same semantic markup that 
they are vending for third parties, so they have more incentive to use 
it and get it right.



Sure, we'll have to disagree on this one.


Now, it may be that microdata will ultimately fail, either because 
it is outcompeted by RDFa, 

Re: [whatwg] Annotating structured data that HTML has no semanticsfor

2009-05-15 Thread Dan Brickley

On 15/5/09 14:11, Shelley Powers wrote:

Kristof Zelechovski wrote:

I do not think anybody in WHATWG hates the CURIE tool; however, the
following problems have been put forward:

Copy-Paste
The CURIE mechanism is considered inconvenient because is not
copy-paste-resilient, and the associated risk is that semantic elements
would randomly change their meaning.


Well, no, the elements won't randomly change their meaning. The only
risk is copying and pasting them into a document that doesn't provide
namespace definitions for the prefixes. Are you thinking that someone
will be using different namespaces but the same prefix? Come on -- do
you really think that will happen?


The most likely case is with Dublin Core, but DC data varies enough 
already that this isn't too destructive...


Dan


Re: [whatwg] Annotating structured data that HTML has no semanticsfor

2009-05-15 Thread Eduard Pascual
On Fri, May 15, 2009 at 1:44 PM, Kristof Zelechovski
giecr...@stegny.2a.pl wrote:
 I do not think anybody in WHATWG hates the CURIE tool; however, the
 following problems have been put forward:

 Copy-Paste
        The CURIE mechanism is considered inconvenient because is not
 copy-paste-resilient, and the associated risk is that semantic elements
 would randomly change their meaning.
Copy-paste issues with RDFa and similar syntaxes can take two forms:
The first is horfaned prefixes: when metadata with a given prefix is
copied, but then it's pasted in a context where the prefix is not
defined. If the user that is copy-pasting this stuff really cares
about metadata, s/he would review the code and make the relevant fixes
and/or copy the prefix declarations; the same way when an author is
copy-pasting content and wants to preserve formatting s/he would copy
the CSS stuff. If the user doesn't actually care about the metadata,
then there is no harm, because properties relying on an unmapped
prefix should yield no RDF output at all.
The second form is prefix clashes: this is actually extremely rare.
For example, someone copies code with FOAF metadata, and then pastes
it on another page: which are the chances that user will be using a
foaf: prefix for something else than FOAF? Sure, there are cases where
a clash might happen but, again, these are only likely to appear on
pages by authors who have some idea about metadata, and hence the
author is more than likely to review the code being pasted to prevent
these and other clashes (such as classes that would mean something
completelly different under the new page's CSS code, element id
clashes, etc). A last possibility is that the author doesn't have any
idea about metadata at all, but is using a CMS that relies on
metadata. In such case, it would be wise on the CMS's part to
pre-process code fragments and either map the prefix to what they mean
(if it's obvious) or remove the invalid data (if the CMS can't figure
out what it should mean).


 Link rot
        CURIE definitions can only be looked up while the CURIE server is
 providing them; the chance of the URL becoming broken is high for
 home-brewed vocabularies.  While the vocabularies can be moved elsewhere, it
 will not always be possible to create a redirect.

Oh, and do reversed domains help at all with this? Ok, with CURIEs
there is a (relatively small) chance for the CURIE to not be
resolvable at a given time; reversed domains have a 100% chance to not
be resolvable at any time: there is always, at least, ambiguity: does
org.example.foo map to foo.example.org, example.org/foo, or
example.org#foo? Even better: what if, under example.org we find a
vocabulary at example.org/foo and another at foo.example.org? (Ok,
that'd be quite unwise, although it might be a legitimate way to keep
deployed and test versions of a vocabulary online at a time; but
anyway CURIEs can cope with it, while reversed domains can't).
Wherever there are links, there is a chance for broken links: that's
part of the nature of links, and the evolving nature of the web. But,
just because the chance of links being broken, would you deny the
utility of elements such as a and link? Reversed domains don't
face broken links because they are simply uncapable to link to
anything.


Now, please, I'd appreciate if you reviewed your arguments before
posting them: while the copy-paste issue is a legitimate argument, and
now we can consider whether this copy-paste-resilience is worth the
costs of microdata, that link rot argument is just a waste of
everyone's time, including yours. Anyway, thanks for that first
argument: that's exactly what I was asking for in the hope of letting
this discussion advance somewhere.

So, before we start comparing benefits against costs, can someone post
anymore benefits or does the copy-paste-resilience point stand alone
against all the costs and possible issues?

Regards,
Eduard Pascual


Re: [whatwg] Annotating structured data that HTML has no semantics for

2009-05-15 Thread Simon Pieters
On Thu, 14 May 2009 22:30:41 +0200, Shelley Powers shell...@burningbird.net 
wrote:

 I'm not 100% sure microdata can really achieve this, but I think making  
 the attempt is a positive step.

 It can't, don't you see?

 Microdata will only work in HTML5/XHTML5.

Actually, as specified, it would work for any text/html and any XHTML content. 
It would just be valid in (X)HTML5, but it would work even if the input is not 
valid (X)HTML5 or looks like HTML4 or XHTML 1.1.


 XHTML 1.1 and yes, 2.0 will be  
 around for years, decades. In addition, XHTML5 already supports RDFa.

XHTML5 supports RDFa to the same extent that XHTML 1.1 supports microdata (in 
both cases, it would work but is not valid).

-- 
Simon Pieters
Opera Software


Re: [whatwg] Annotating structured data that HTML has nosemanticsfor

2009-05-15 Thread Kristof Zelechovski
Links do not contribute to the behavior the meaning of the text contained
within them and not to its meaning. which does not depend on whether the
link is broken or not.  Moreover, whether the linked resource can be
retrieved at all depends on the URI scheme, as in href=mailto:u...@domain;.
The advertised advantage of CURIE prefixes is that the metadata declaration
can be retrieved and looked up, and that can influence the meaning of the
text thus marked.  Therefore, link rot is a bigger problem for CURIE
prefixes than for links.

I think the original URL corresponding to a reversed domain prefix is
irrelevant, and attempts to reconstruct it are futile anyway.  Nonexistent
features are better than features that decay progressively, at least as far
as a specification is concerned.

Best regards,
Chris



[whatwg] Link rot is not dangerous (was: Re: Annotating structured data that HTML has nosemanticsfor)

2009-05-15 Thread Manu Sporny
Kristof Zelechovski wrote:
 Therefore, link rot is a bigger problem for CURIE
 prefixes than for links.

There have been a number of people now that have gone to great lengths
to outline how awful link rot is for CURIEs and the semantic web in
general. This is a flawed conclusion, based on the assumption that there
must be a single vocabulary document in existence, for all time, at one
location. This has also lead to a false requirement that all
vocabularies should be centralized.

Here's the fear:

If a vocabulary document disappears for any reason, then the meaning of
the vocabulary is lost and all triples depending on the lost vocabulary
become useless.

That fear ignores the fact that we have a highly available document
store available to us (the Web). Not only that, but these vocabularies
will be cached (at Google, at Yahoo, at The Wayback Machine, etc.).

IF a vocabulary document disappears, which is highly unlikely for
popular vocabularies - imagine FOAF disappearing overnight, then there
are alternative mechanisms to extract meaning from the triples that will
be left on the web.

Here are just two of the possible solutions to the problem outlined:

- The vocabulary is restored at another URL using a cached copy of the
vocabulary. The site owner of the original vocabulary either re-uses the
vocabulary, or re-directs the vocabulary page to another domain
(somebody that will ensure the vocabulary continues to be provided -
somebody like the W3C).
- RDFa parsers can be given an override list of legacy vocabularies that
will be loaded from disk (from a cached copy). If a cached copy of the
vocabulary cannot be found, it can be re-created from scratch if necessary.

The argument that link rot would cause massive damage to the semantic
web is just not true. Even if there is minor damage caused, it is fairly
easy to recover from it, as outlined above.

-- manu

-- 
Manu Sporny
President/CEO - Digital Bazaar, Inc.
blog: A Collaborative Distribution Model for Music
http://blog.digitalbazaar.com/2009/04/04/collaborative-music-model/



Re: [whatwg] Link rot is not dangerous

2009-05-15 Thread Dan Brickley

On 15/5/09 18:20, Manu Sporny wrote:

Kristof Zelechovski wrote:

Therefore, link rot is a bigger problem for CURIE
prefixes than for links.


There have been a number of people now that have gone to great lengths
to outline how awful link rot is for CURIEs and the semantic web in
general. This is a flawed conclusion, based on the assumption that there
must be a single vocabulary document in existence, for all time, at one
location. This has also lead to a false requirement that all
vocabularies should be centralized.

Here's the fear:

If a vocabulary document disappears for any reason, then the meaning of
the vocabulary is lost and all triples depending on the lost vocabulary
become useless.

That fear ignores the fact that we have a highly available document
store available to us (the Web). Not only that, but these vocabularies
will be cached (at Google, at Yahoo, at The Wayback Machine, etc.).

IF a vocabulary document disappears, which is highly unlikely for
popular vocabularies - imagine FOAF disappearing overnight, then there
are alternative mechanisms to extract meaning from the triples that will
be left on the web.

Here are just two of the possible solutions to the problem outlined:

- The vocabulary is restored at another URL using a cached copy of the
vocabulary. The site owner of the original vocabulary either re-uses the
vocabulary, or re-directs the vocabulary page to another domain
(somebody that will ensure the vocabulary continues to be provided -
somebody like the W3C).
- RDFa parsers can be given an override list of legacy vocabularies that
will be loaded from disk (from a cached copy). If a cached copy of the
vocabulary cannot be found, it can be re-created from scratch if necessary.

The argument that link rot would cause massive damage to the semantic
web is just not true. Even if there is minor damage caused, it is fairly
easy to recover from it, as outlined above.


A few other points:

1. It's for the community of vocabulary-creators to help each other out 
w.r.t. hosting/publishing these: I just nudged a friend to put another 5 
years on the DNS rental for a popular namespace. I think we should put a 
bit more structure around these kinds of habit, so that popular 
namespaces won't drop off the Web through accident.


2. digitally signing the schemas will become part of the story, I'm 
sure. While it's a bit fiddly, there are advantages to having other 
mechanisms beyond URI de-referencing for knowing where a schema came from


3. Parties worried about external dependencies when using namespaces can 
always indirect through their own namespace, whose schema document can 
declare subclass/subproperty relations to other URIs


cheers

Dan




Re: [whatwg] Link rot is not dangerous (was: Re: Annotating structured data that HTML has nosemanticsfor)

2009-05-15 Thread Kristof Zelechovski
I understand that there are ways to recover resources that disappear from
the Web; however, the postulated advantage of RDFa you can go see what it
means simply does not hold.  The recovery mechanism, Web search/cache,
would be as good for CURIE URL as for domain prefixes.  Creating a redirect
is not always possible and the built-in redirect dictionary (CURIE catalog?)
smells of a central repository.  This is no better than public entity
identifiers in XML.

Serving the vocabulary from the own domain is not always possible, e.g. in
case of reader-contributed content, and only guarantees that the vocabulary
will be alive while it is supported by the domain owner.  (WHATWG wants HTML
documents to be readable 1000 years from now.)  It is not always practical
either as it could confuse URL-based tools that do not retrieve the
resources referenced.

All this does not imply, of course, that RDFa is no good.  It is only
intended to demonstrate that the postulated advantage of the CURIE lookup is
wishful thinking.

Best regards,
Chris



Re: [whatwg] Link rot is not dangerous

2009-05-15 Thread Shelley Powers

Dan Brickley wrote:

On 15/5/09 18:20, Manu Sporny wrote:

Kristof Zelechovski wrote:

Therefore, link rot is a bigger problem for CURIE
prefixes than for links.


There have been a number of people now that have gone to great lengths
to outline how awful link rot is for CURIEs and the semantic web in
general. This is a flawed conclusion, based on the assumption that there
must be a single vocabulary document in existence, for all time, at one
location. This has also lead to a false requirement that all
vocabularies should be centralized.

Here's the fear:

If a vocabulary document disappears for any reason, then the meaning of
the vocabulary is lost and all triples depending on the lost vocabulary
become useless.

That fear ignores the fact that we have a highly available document
store available to us (the Web). Not only that, but these vocabularies
will be cached (at Google, at Yahoo, at The Wayback Machine, etc.).

IF a vocabulary document disappears, which is highly unlikely for
popular vocabularies - imagine FOAF disappearing overnight, then there
are alternative mechanisms to extract meaning from the triples that will
be left on the web.

Here are just two of the possible solutions to the problem outlined:

- The vocabulary is restored at another URL using a cached copy of the
vocabulary. The site owner of the original vocabulary either re-uses the
vocabulary, or re-directs the vocabulary page to another domain
(somebody that will ensure the vocabulary continues to be provided -
somebody like the W3C).
- RDFa parsers can be given an override list of legacy vocabularies that
will be loaded from disk (from a cached copy). If a cached copy of the
vocabulary cannot be found, it can be re-created from scratch if 
necessary.


The argument that link rot would cause massive damage to the semantic
web is just not true. Even if there is minor damage caused, it is fairly
easy to recover from it, as outlined above.


A few other points:

1. It's for the community of vocabulary-creators to help each other 
out w.r.t. hosting/publishing these: I just nudged a friend to put 
another 5 years on the DNS rental for a popular namespace. I think we 
should put a bit more structure around these kinds of habit, so that 
popular namespaces won't drop off the Web through accident.


2. digitally signing the schemas will become part of the story, I'm 
sure. While it's a bit fiddly, there are advantages to having other 
mechanisms beyond URI de-referencing for knowing where a schema came from


3. Parties worried about external dependencies when using namespaces 
can always indirect through their own namespace, whose schema document 
can declare subclass/subproperty relations to other URIs


cheers

Dan




The most important point to take from all of this, though, is that link 
rot within the RDF world is an extremely rare and unlikely occurrence. 
I've been working with RDF for close to a decade, and link rot has never 
been an issue.


One of the very first uses of RDF, in RSS 1.0, for feeds, is still in 
existence, still viable. You don't have to take my word, check it out 
yourselves:


http://purl.org/rss/1.0/

Even if, and I want to strongly emphasize if link rot does occur, both 
Manu and Dan have demonstrated multiple ways of ensuring that no meaning 
is lost, and nothing is broken. However, I hope that people are open 
enough to take away from their discussions that  they are trying to 
treat this concern respectfully, and trying to demonstrate that there's 
more than one solution. Not that this forms a proof that Oh my god, 
if we use RDF, we're doomed!


Also don't lose sight that this is really no more serious an issue than, 
say, a company originating com.sun.* being purchased by another 
company, named com.oracle.*.  And you can't say, Well that's not the 
same, because it is.


The only safe bet is to designate some central authority and give them 
power over every possible name. Then we run the massive risk of this 
system failing (and this applies to microdata's reverse DNS as well as 
RDF's URI), or it being taken over by an entity that sees such a data 
store as a way to make a great profit. We also defeat the very principle 
on which semantic data on the web abides, and that's true whether you're 
support microdata or RDF.


Shelley






Re: [whatwg] Link rot is not dangerous

2009-05-15 Thread Kristof Zelechovski
Classes in com.sun.* are reserved for Java implementation details and should
not be used by the general public.  CURIE URL are intended for general use.

So, I can say Well, it is not the same, because it is not.

Cheers,
Chris



Re: [whatwg] Link rot is not dangerous

2009-05-15 Thread Manu Sporny
Kristof Zelechovski wrote:
 I understand that there are ways to recover resources that disappear from
 the Web; however, the postulated advantage of RDFa you can go see what it
 means simply does not hold. 

This is a strawman argument more below...

 All this does not imply, of course, that RDFa is no good.  It is only
 intended to demonstrate that the postulated advantage of the CURIE
 lookup is wishful thinking.

That train of logic seems to falsely conclude that if something does not
hold true 100% of the time, then it cannot be counted as an advantage.

Example:

Since the postulated advantage of RAID-5 is that a disk array is
unlikely to fail due to a single disk failure, and since it is possible
for more than one disk to fail before a recovery is complete, one cannot
call running a disk array in RAID-5 mode an advantage to not running
RAID at all (because failure is possible).

or

Since the postulated advantage of CURIEs is that you can go see what it
means and it is possible for a CURIE defined URL to be unavailable, one
cannot call it an advantage because it may fail.

There are two flaws in the premises and reasoning above, for the CURIE case:

- It is assumed that for something to be called an 'advantage' that it
  must hold true 100% of the time.
- It is assumed that most proponents of RDFa believe that you can go
  see what it means holds at all times - one would have to be very
  deluded to believe that.

 The recovery mechanism, Web search/cache,
 would be as good for CURIE URL as for domain prefixes.  Creating a redirect
 is not always possible and the built-in redirect dictionary (CURIE catalog?)
 smells of a central repository. 

Why does having a file sitting on your local machine that lists
alternate vocabulary files for CURIEs smell of a central repository?
Perhaps you're assuming that the file would be managed by a single
entity? If so, it wouldn't need to be and that was not what I was proposing.

 Serving the vocabulary from the own domain is not always possible, e.g. in
 case of reader-contributed content, 

This isn't clear, could you please clarify what you mean by
reader-contributed content?

 and only guarantees that the vocabulary
 will be alive while it is supported by the domain owner.

This case and it's solution was already covered previously. Again - if
the domain owner disappears, the domain disappears, or the domain owner
doesn't want to cooperate for any reason, one could easily set up an
alternate URL and instruct the RDFa processor to re-direct any
discovered CURIEs that match the old vocabulary to the new
(referenceable) vocabulary.

 (WHATWG wants HTML documents to be readable 1000 years from now.)  

Is that really a requirement? What about external CSS files that
disappear? External Javascript files that disappear? External SVG files
that disappear? All those have something to do with the document's
human/machine readability. Why is HTML5 not susceptible to link rot in
the same way that RDFa is susceptible to link rot?

Also, why 1000 years, that seems a bit arbitrary? =P

 It is not always practical either as it could confuse URL-based 
 tools that do not retrieve the resources referenced.

Could you give an example of this that wouldn't be a bug in the
dereferencing application? How could a non-dereference-able URL confuse
URL-based tools?

-- manu

-- 
Manu Sporny
President/CEO - Digital Bazaar, Inc.
blog: A Collaborative Distribution Model for Music
http://blog.digitalbazaar.com/2009/04/04/collaborative-music-model/



Re: [whatwg] Annotating structured data that HTML has no semanticsfor

2009-05-15 Thread Tab Atkins Jr.
On Fri, May 15, 2009 at 9:17 AM, Eduard Pascual herenva...@gmail.com wrote:
 On Fri, May 15, 2009 at 1:44 PM, Kristof Zelechovski
 Link rot
        CURIE definitions can only be looked up while the CURIE server is
 providing them; the chance of the URL becoming broken is high for
 home-brewed vocabularies.  While the vocabularies can be moved elsewhere, it
 will not always be possible to create a redirect.

 Oh, and do reversed domains help at all with this? Ok, with CURIEs
 there is a (relatively small) chance for the CURIE to not be
 resolvable at a given time; reversed domains have a 100% chance to not
 be resolvable at any time: there is always, at least, ambiguity: does
 org.example.foo map to foo.example.org, example.org/foo, or
 example.org#foo? Even better: what if, under example.org we find a
 vocabulary at example.org/foo and another at foo.example.org? (Ok,
 that'd be quite unwise, although it might be a legitimate way to keep
 deployed and test versions of a vocabulary online at a time; but
 anyway CURIEs can cope with it, while reversed domains can't).
 Wherever there are links, there is a chance for broken links: that's
 part of the nature of links, and the evolving nature of the web. But,
 just because the chance of links being broken, would you deny the
 utility of elements such as a and link? Reversed domains don't
 face broken links because they are simply uncapable to link to
 anything.

Reversed domains aren't *meant* to link to anything.  They shouldn't
be parsed at all.  They're a uniquifier so that multiple vocabularies
can use the same terms without clashing or ambiguity.  The Microdata
proposal also allows normal urls, but they are similarly nothing more
than a uniquifier.

CURIEs, at least theoretically, *rely* on the prefix lookup.  After
all, how else can you tell that a given relation is really the same
as, say, foaf:name?  If the domain isn't available, the data will be
parsed incorrectly.  That's why link rot is an issue.

~TJ


[whatwg] Link rot is not dangerous

2009-05-15 Thread Manu Sporny
Tab Atkins Jr. wrote:
 Reversed domains aren't *meant* to link to anything.  They shouldn't
 be parsed at all.  They're a uniquifier so that multiple vocabularies
 can use the same terms without clashing or ambiguity.  The Microdata
 proposal also allows normal urls, but they are similarly nothing more
 than a uniquifier.
 
 CURIEs, at least theoretically, *rely* on the prefix lookup.  After
 all, how else can you tell that a given relation is really the same
 as, say, foaf:name?  If the domain isn't available, the data will be
 parsed incorrectly.  That's why link rot is an issue.

Where in the CURIE spec does it state or imply that if a domain isn't
available, that the resulting parsed data will be invalid?

-- manu

-- 
Manu Sporny
President/CEO - Digital Bazaar, Inc.
blog: A Collaborative Distribution Model for Music
http://blog.digitalbazaar.com/2009/04/04/collaborative-music-model/



Re: [whatwg] Link rot is not dangerous

2009-05-15 Thread Kristof Zelechovski
Serving the RDFa vocabulary from the own domain is not always possible, e.g.
when a reader of a Web site is encouraged to post a comment to the page she
reads and her comment contains semantic annotations.

The probability of a URL becoming unavailable is much greater than that of
both mirrored drives wearing out at the same time.  (data mirroring does not
claim it protects from fire, water, high voltage, magnetic storms,
earthquakes and the like; it only protects you from natural wear.)  The
probability of ultimately losing data stored in one copy is 1; the
probability of a URL going down is close to 1.  So, RAID works in most
cases, CURIE URL do not (ultimately) work in most cases.

Disappearing CSS is not a problem for HTML because CSS does not affect the
meaning of the page.

Disappearing scripts are a problem for HTML but they are not a problem for
HTML *data*.  In other words, script-generated content is not guaranteed to
survive, and there is nothing we can do about that except for a warning.
Such content cannot be HTML-validated either.  In general, scripts are best
used (and intended) for behavior, not for creating content.

External SVG files do not describe existing content, they *are* (embedded)
content.  If a HTML file disappears, it becomes unreadable as well, but that
problem obviously cannot be solved from within HTML :-)

HTML should be readable in 1000 years from now was an attempt to visualize
the intention of persistence.  It should not be understood as best before,
of course.

If the author chooses to create a redirect to a well-known vocabulary using
a dependent vocabulary stored at his own site in order to prevent link rot,
tools that recognize vocabulary URL without reading the corresponding
resources will be unable to recognize the author's intent, and for the tools
that do read the original vocabulary will still be unavailable, so this
method causes more problems than it solves.

Cheers,
Chris




Re: [whatwg] Link rot is not dangerous

2009-05-15 Thread Shelley Powers

Kristof Zelechovski wrote:

Classes in com.sun.* are reserved for Java implementation details and should
not be used by the general public.  CURIE URL are intended for general use.

So, I can say Well, it is not the same, because it is not.

Cheers,
Chris


  
But we're not dealing with Java anymore. We're dealing with using 
reversed DNS concatenated with some kind of default URI, to create some 
kind of bastardized URL, which actually is valid, though incredibly 
painful to see, and can be implied to actually take one to to a web address.


You don't have to take my word for it -- check out Philip's testing demo 
for microdata. You get triples with the following:


http://www.w3.org/1999/xhtml/custom#com.damowmow.cat

http://philip.html5.org/demos/microdata/demo.html#output_ntriples

Not only do you face problems with link rot, you also face a significant 
amount of confusion, as people look at that and go, What the hell is 
that?


Oh, and you can say, Well, but we don't _mean_ anything by it -- but 
what does that have to do with anything? People don't go running the 
spec everytime they see something. They look at this thing and think, 
Oh, a link. I wonder where it goes. You go ahead and try it, and 
imagine for a moment the confusion when it goes absolutely no where. 
Except that I imagine the W3C folks are getting a little annoyed with 
the HTML WG now, for allowing this type of thing in, generating a whole 
bunch of 404 errors for the web master(s).


But hey, you've given me another idea. I think I'll create my own 
vocabulary items, with the reversed DNS 
http://www.w3.org/1999/xhtml/custom#com.sun.*. No, maybe 
http://www.w3.org/1999/xhtml/custom#com.opera.*. Nah, how about 
http://www.w3.org/1999/xhtml/custom#com.microsoft.*. Yeah, that's cool. 
And there is no mechanism is place to prevent this, because unlike 
regular URIs, where the domain is actually controlled by specific 
entity, you've created the world famous W3C fudge pot. Anything goes.


I can't wait for the lawsuits on this one. You think that cybersquatting 
is an issue on the web, or facebook, or Twitter, wait until you see 
people use com.microsoft.*.


Then there's the vocabulary that was created by foobar.com, that people 
think, Hey, cool, I'll use that...whatever it is. After all, if you 
want to play with the RDF kids, your vocabularies have to be usable by 
other people.


But Foobar takes a dive in the dot com pool, and foobar.com gets taken 
over by a porn establishment. Yeah, I can't wait for people to explain 
that one to the boss. Just because it doesn't link, won't mean it won't 
end up on Twitter as a big, huge joke.


If you want to find something to criticize, I think it's important to 
realize that hey, folks, you've just stepped over the line, and you're 
now in the Zone of Decentralization. Whatever impacts us, babes, impacts 
all of you. Because if you look at Philip's example, you're going to see 
the same set of vocabulary URIs we're using for RDF right now, as 
microdata uses our stuff, too. Including the links that are all 
trembling on the edge on the self-implosion.


So the point of all of this is moot.

But it was fun. Really fun. Have a great weekend.

Shelley


Re: [whatwg] Link rot is not dangerous

2009-05-15 Thread Philip Taylor
On Fri, May 15, 2009 at 6:25 PM, Shelley Powers
shell...@burningbird.net wrote:
 The most important point to take from all of this, though, is that link rot
 within the RDF world is an extremely rare and unlikely occurrence.

That seems to be untrue in practice - see
http://philip.html5.org/data/rdf-namespace-status.txt

The source data is the list of common RDF namespace URIs at
http://ebiquity.umbc.edu/resource/html/id/196/Most-common-RDF-namespaces
from three years ago. Out of those 284:
 * 56 are 404s. (Of those, 37 end with '#', so that URI itself really
ought to exist. In the other cases, it'd be possible that only the
prefix+suffix URIs are meant to exist. Some of the cases are just
typos, but I'm not sure how many.)
 * 2 are Forbidden. (Of those, 1 looks like a typo.)
 * 2 are Bad Gateway.
 * 22 could not connect to the server. (Of those, 2 weren't http://
URIs, and 1 was a typo. The others represent 13 different domains.)

(For the URIs which returned Redirect responses, I didn't check what
happens when you request the URI it redirected to, so there may be
more failures.)

Over a quarter of the most common namespace URIs don't resolve
successfully today, and most of those look like they should have
resolved when they were originally used, so link rot seems to be
common.

(Major vocabularies like RSS and FOAF are likely to exist for a long
time, but they're the easiest cases to handle - we could just
pre-define the prefixes rss: and foaf: and have a centralised
database mapping them onto schemas/documentation/etc. It seems to me
that URIs are most valuable to let any tiny group make one for their
rarely-used vocabulary, and be guaranteed no name collisions without
needing to communicate with a centralised registry to ensure
uniqueness; but it's those cases that are most vulnerable to link rot,
and in practice the links appear to fail quite often.)

(I'm not arguing that link rot is dangerous - just that the numbers
indicate it's a common situation rather than an extremely rare
exception.)

-- 
Philip Taylor
exc...@gmail.com


Re: [whatwg] Link rot is not dangerous

2009-05-15 Thread Tab Atkins Jr.
On Fri, May 15, 2009 at 1:32 PM, Manu Sporny mspo...@digitalbazaar.com wrote:
 Tab Atkins Jr. wrote:
 Reversed domains aren't *meant* to link to anything.  They shouldn't
 be parsed at all.  They're a uniquifier so that multiple vocabularies
 can use the same terms without clashing or ambiguity.  The Microdata
 proposal also allows normal urls, but they are similarly nothing more
 than a uniquifier.

 CURIEs, at least theoretically, *rely* on the prefix lookup.  After
 all, how else can you tell that a given relation is really the same
 as, say, foaf:name?  If the domain isn't available, the data will be
 parsed incorrectly.  That's why link rot is an issue.

 Where in the CURIE spec does it state or imply that if a domain isn't
 available, that the resulting parsed data will be invalid?

Assume a page that uses both foaf and another vocab that subclasses
many foaf properties.  Given working lookups for both, the rdf parser
can determine that two entries with different properties are really
'the same', and hopefully act on that knowledge.

If the second vocab 404s, that information is lost.  The parser will
then treat any use of that second vocab completely separately from the
foaf, losing valuable semantic information.

(Please correct any misunderstandings I may be operating under; I'm
not sure how competent parsers currently are, and thus how much they'd
actually use a working subclassed relation.)

~TJ


Re: [whatwg] Link rot is not dangerous

2009-05-15 Thread Shelley Powers

Philip Taylor wrote:

On Fri, May 15, 2009 at 6:25 PM, Shelley Powers
shell...@burningbird.net wrote:
  

The most important point to take from all of this, though, is that link rot
within the RDF world is an extremely rare and unlikely occurrence.



That seems to be untrue in practice - see
http://philip.html5.org/data/rdf-namespace-status.txt

The source data is the list of common RDF namespace URIs at
http://ebiquity.umbc.edu/resource/html/id/196/Most-common-RDF-namespaces
from three years ago. Out of those 284:
 * 56 are 404s. (Of those, 37 end with '#', so that URI itself really
ought to exist. In the other cases, it'd be possible that only the
prefix+suffix URIs are meant to exist. Some of the cases are just
typos, but I'm not sure how many.)
 * 2 are Forbidden. (Of those, 1 looks like a typo.)
 * 2 are Bad Gateway.
 * 22 could not connect to the server. (Of those, 2 weren't http://
URIs, and 1 was a typo. The others represent 13 different domains.)

(For the URIs which returned Redirect responses, I didn't check what
happens when you request the URI it redirected to, so there may be
more failures.)

Over a quarter of the most common namespace URIs don't resolve
successfully today, and most of those look like they should have
resolved when they were originally used, so link rot seems to be
common.

(Major vocabularies like RSS and FOAF are likely to exist for a long
time, but they're the easiest cases to handle - we could just
pre-define the prefixes rss: and foaf: and have a centralised
database mapping them onto schemas/documentation/etc. It seems to me
that URIs are most valuable to let any tiny group make one for their
rarely-used vocabulary, and be guaranteed no name collisions without
needing to communicate with a centralised registry to ensure
uniqueness; but it's those cases that are most vulnerable to link rot,
and in practice the links appear to fail quite often.)

(I'm not arguing that link rot is dangerous - just that the numbers
indicate it's a common situation rather than an extremely rare
exception.)

  
Philip, I don't think the occurrence of link rot causing problems in the 
RDF world is all that common, but thanks for looking up this data. 
Actually I will probably quote your info on my next writing at my weblog.


I'd like to be dropped from any additional emails in this thread. After 
all, I  have it on good authority I'm not open for rational discussion. 
So I'll leave this type of thing to you guys.


Thanks

Shelley


Re: [whatwg] Link rot is not dangerous

2009-05-15 Thread Kristof Zelechovski
The problem of cybersquatting of oblique domains is, I believe, described
and addressed in tag URI scheme definition [RFC4151], which I think is
something rather similar to the constructs used for HTML microdata.  I think
that document is relevant not only to this discussion but to the whole
concept.
IMHO,
Chris




Re: [whatwg] Annotating structured data that HTML has no semantics for

2009-05-15 Thread Tab Atkins Jr.
On Wed, May 13, 2009 at 10:04 AM, Leif Halvard Silli l...@malform.no wrote:
 Toby Inkster on Wed May 13 02:19:17 PDT 2009:

 Leif Halvard Silli wrote:

  Hear hear.  Lets call it Cascading RDF Sheets.

 http://buzzword.org.uk/2008/rdf-ease/spec

 http://buzzword.org.uk/2008/rdf-ease/reactions

 I have actually implemented it. It works.

 Oh! Thanks for sharing.

Indeed, RDF-EASE seems fairly nice!

 RDFa is better though.

 What does 'better' mean in this context? Why and how? Because it is easier
 to process? But EASE seems more compatible with microformats, and is
 better in that sense.

I'd also like clarification here.  I dislike *all* of the inline
metadata proposals to some degree, for the same reasons that I dislike
inline @style and @onfoo handlers.  A Selector-based way of applying
semantics fits my theoretical needs much better.

 I read all the reactions you pointed to. Some made the claim that EASE would
 move semantics out of the HTML file, and that microformats was better as it
 keeps the semantics inside the file. But I of course agree with you that
 EASE just underline/outline the semantics already in the file.

Yup.  The appropriate critique of separated metadata is that the
*data* is moved out of the document, where it will inevitably decay
compared to the live document.  RDF-EASE keeps all the data stored in
the live document, and merely specifies how to extract it.  The only
way you can lose data then is by changing the html structure itself,
which is much less common than just changing the content.

 From the EASE draft:

 All properties in RDF-EASE begin with the string -rdf-, as per §4.1.2.1
 Vendor-specific extensions in [CSS21]. This allows RDF-EASE and CSS to be
 safely mixed in one file, [...]

 I wonder why you think it is so important to be able to mix CSS and EASE. It
 seems better to separate the two completely.

I'm not thrilled with the mixture of CSS and metadata either.  Just
because it uses Selectors doesn't mean it needs to be specifiable
alongside CSS.  jQuery uses Selectors too, but it stays where it
belongs.  ^_^  (That being said, there's a plugin for it that allows
you to specify js in your CSS, and it gets applied to the matching
elements from the block's selector.)

~TJ