Re: [whatwg] Trying to work out the problems solved by RDFa

Calogero Alex Baldacchino Sat, 03 Jan 2009 11:22:38 -0800

Dan Brickley ha scritto:

On 3/1/09 14:02, Julian Reschke wrote:
Tab Atkins Jr. wrote:
The most successful alternative is nothing at all. ^_^ We can
extract copious data from web pages reliably without metadata, either
using our human senses (in personal use) or natural-language-based
processing (in search engine use). It has not yet been established
that sufficient and significant enough problems *exist* to justify a
solution, let alone one that requires an addition to html. That is
what Ian is specifically looking for.
That's what you and Ian claim. Many disagree.
My main problem with the natural language processing option is that itfeels too close to waiting for Artificial Intelligence. I'd rather add6 attributes to HTML and get on with life.
But perhaps a more practical concern is that it unfairly biases thingstowards popular languages - lucky English, lucky Spanish, etc., andthose that lend themselves more to NLP analysis. *The Web is foreveryone*, and people shouldn't be forced to read and write English toenjoy the latest advances in *Web automation*. Since HTML5 is goingthrough W3C, such considerations need to be taken pretty seriously.

My concern is: is RDFa really suitable for everyone and for Webautomation? My own answer, at first glance, is no. That's because RDF(a)can perhaps address nicely very niche needs, where determining how muchdata can be trusted is not a problem, but in general misuses ANDdeliberate abuses may harm automation heavily, since an automaton isunlikely to be able to understand whether metadata express the realmeaning of a web page or not (without a certain degree of AI).

If an external mechanism is needed to determine trust level formetadata, that is to establish when an automation results are good orbad, such a mechanism may involve human beings at some stage, thusbreaking automation (this is somehow similar to the problem of definingan "oracle machine" described by Turing, according to whom such amachine isn't an automaton).

On another hand, a very custom model thought for very custom needs (andnot requiring wide support) may be less prone to abuses, since it'sunlikely to find someone willing to cheat himself. Thus, having thirdparties agreeing a certain model and related APIs, and implementing APIson their own sides, might be more reliable in some cases (anyway, thirdparties should agree their respective metadata are reliable and find away to evaluate they really are).


Dan Brickley ha scritto:

On 3/1/09 16:54, Håkon Wium Lie wrote:
Also sprach Dan Brickley:
> My main problem with the natural language processing option isthat it> feels too close to waiting for Artificial Intelligence. I'drather add 6
  >  attributes to HTML and get on with life.

:-)
Another thought re NLP. RDFa (and similar, ...) are formats that canbe used for writing down the conclusions of NLP analysis. For examplehere see the BBC's recent Muddy Boots experiment, using DBPedia(Wikipedia in RDF) data to drive autoclassification / named entityrecognition. So here we can agree with Ian and others that textanalysis has much to offer, and still use RDFa (or other semanticmarkup - i'll sidestep that debate for now) as a notation for markingup the words with a machine-friendly indicator of their NLP-guessedmeaning.
http://www.bbc.co.uk/blogs/journalismlabs/2008/12/muddy_boots.html
Personally, I think the 'class' attribute may still be a more
compelling option in a less-is-more way. It already exists and can
easily be used for styling purposes. Styling is bait for authors to
disclose semantics.
I'm sure there's mileage to be had there. I'm somehow incapable ofwriting XSLT so GRDDL hasn't really charmed me, but 'class' certainlycorresponds to a lot of meaningful markup. Naturally enough it isstronger at tagging bits of information with a category than atdefining relationships amongst the things defined when they'rescattered around the page. But that's no reason to dismiss it entirely.
Did you see the RDF-EASE draft,http://buzzword.org.uk/2008/rdf-ease/spec? From which comes: "Tensecond sales pitch: CSS is an external file that specifies how yourdocument should look; *RDF-EASE is an external file that specifieswhat your document means.*"
RDF-EASE uses CSS-based syntax. More discussion here,http://lists.w3.org/Archives/Public/semantic-web/2008Dec/0148.htmlincluding question of whether it ought to be expressed usingcss3-namespace,http://lists.w3.org/Archives/Public/semantic-web/2008Dec/0175.html
chers,

Dan

--
http://danbri.org/

My question is: how often can I trust such a file specifies what yourdocument really means, without evaluating its content?


I'd distinguish two cases (not pretendig to make a complete classification),

- The semantics described by metadata is used for server-sidecomputations: there's no need to evaluate content (since I'm trusting toyou when navigating your site, and it's unlikely to find you purposedlymessing with yourself), as well as to have client-side support for suchmetadata (by the UA). This is the case of a centralised database.

For instance, a *pedia page may send queries to the server, whichelaborates them and sends results back the the user.

- The UA must understand metadata and automatically gather informationsmeshed-up in a page from several sources: each source must be activelyevaluated and trusted (a bot can't do such). This is the case of adecentralized database.

For instance, that's easy to think of a spamming advertiser whoapparently puts honest content into your pages (which maybe takereliable content from dbpedia), whereas he uses fake metadata to cheatmy browser and send me irrelevant informations (or infos I'm notinterested in) when I ask for related content [1], perhaps without youeven guessing what's going on (and you may be loosing visitors becauseof that).

For obvious reasons, a trust evaluation mechanism can't be as easy asgetting/creating a signature to be used in a secure connection, becausesomeone must actively evaluate at least two things:

- the metadata really reflects a resource content, and

- the metadata is properly used with respect to an external schemainvolved to model data (otherwise, no relationship would be reliable --however, such might be a minor concern from a certain angle, sincemisused metadata might be less harmful than deliberately abused ones).

The result can be very expensive (as certifying a driver or anapplication for a certain platform), or lead to a free choice to avoidany evaluation and instead to trust to any third parties. Both solutionsmay work, perhaps, for niche/limited cases, but I don't think such maybe a good base for a "global" - and general purpose - automation.

[1] That's not the same as using the @rel attribute without anyrelationship with other metadata: a UA may just provide a link somehowdescribed as pointing to a related resource with respect to thesurrounding content, so that I can choose to follow such a link or not;if the @rel attribute is used by an automated mechanism in response to aquery and with respect to other metadata, the UA must decide on its ownif a link is worth to be followed or not, and I don't think there is anyeasy way to take automated decisions involving trust.


Best regards,
Alex


--
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP 
autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
Incrementa la visibilita' della tua azienda con l'invio di newsletter e 
campagne email marketing.
* Con investimento di soli 250 Euro puoi incrementare la tua visibilita'
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8350&d=3-1

Re: [whatwg] Trying to work out the problems solved by RDFa

Reply via email to