On Fri, Jan 9, 2009 at 3:22 PM, Ben Adida <b...@adida.net> wrote: > Tab Atkins Jr. wrote: >> However, Ian has a point in his first paragraph. SearchMonkey does >> *not* do auto-discovery; it relies entirely on site owners telling it >> precisely what data to extract, where it's allowed to extract it from, >> and how to present it. > > That's incorrect. > > You can build a SearchMonkey infobar that is set to function on all URLs > (just use "*" in your URL field.) > > For example, the Creative Commons SearchMonkey application: > > http://gallery.search.yahoo.com/application?smid=kVf.s > > (currently broken because of a recent change in the SearchMonkey PHP API > that we need to address, so here's a photo: > > http://www.flickr.com/photos/ysearchblog/2869419185/ > ) > > By adding the CC RDFa markup to your page, it will show up with the > infobar in Yahoo searches.
Ah, hadn't considered a net-wide SearchMonkey script. Interesting. This brings up different issues, however. Something I see immediately: Say I'm a scammer. I know that the CC SearchMonkey app is in wide use (pretend, here). I start putting CC-RDF data in spam blog comments, with my own spammy stuff in the relevant fields. Now people don't even have to click on the blog link in the search results and read my obviously spammy comment to be introduced to my offers for discount Viagra! They'll just see a little CC bar, click on it to have it open in-place, and there I am. I could even hide my link in legitimate license data, so that people only hit my malicious site when they click the link to see more information about the license. Issues like these make wide-scale auto-trusted use of metadata difficult. It also makes me more reluctant to want it in the spec yet. I'd rather see the community work out these problems first. It may be that there's a relatively simple solution. It may be that the crawlers can reliably distinguish between ham and spam CC data. But then, it may be that there *is* no good solution enabling us to use this approach, and this kind of metadata on arbitrary sites just can't be trusted. I, personally, don't know the answer to this yet. I suspect that you don't, either; if the arbitrary-site CC infobar works at all, it's because few people *use* CC RDF yet, and so it's still limited to a community with implicit trust. > So site-specific microformats are clearly less powerful. And > vocabulary-specific microformats, while useful, are also not as useful > here (consider a SearchMonkey application that picks up CC-licensed > items, be they video, audio, books, scientific data, etc... Different > microformats = development hell.) Indeed, they are less powerful. As I explored above, though, too much power can be damning. It may be that the site-specific little-m microformat (or something equivalent, allowing a developer to extract metadata through actively targeting site structure) is powerful enough to be useful, but weak enough to *remain* useful in the face of abuse. (Also, I know CC is sort of the darling of the RDFa community, but there's significant enough debate over in-band vs out-of-band licensing info, etc. that detracts from the core issues we're trying to discuss here that it's probably not the best example to use.) > Have you read the RDFa Primer? > http://www.w3.org/TR/xhtml-rdfa-primer/ > > It describes (pre-SearchMonkey) the kind of applications that can be > built with RDFa. SearchMonkey is an ideal example, but it's by no means > the only one. Yup; I was an active participant in this discussion when it started last August. The example applications discussed in the paper, unfortunately, are precisely the kind where trusting metadata is likely a *bad* idea. For example, finding reviews of shows produced by friends of Alice, using foaf and hreview, is rife with opportunity for spamming. SearchMonkey seems to avoid this for the most part; when designing applications for particular URLs, at least, you are relying on relatively trustworthy data, not arbitrary data scattered across the web. Perhaps something similar has application within trusted networks, but in that case it comprises a completely different use case than what SearchMonkey hits, with possibly different requirements. ~TJ