On Fri, Jan 9, 2009 at 2:17 PM, Ben Adida <b...@adida.net> wrote: > Tab Atkins Jr. wrote: >> Actually, SearchMonkey is an excellent use case, and provides a >> problem statement. > > I'm surprised, but very happily so, that you agree. > > My confusion stems from the fact that Ian clearly mentioned SearchMonkey > in his email a few days ago, then proceeded to say it wasn't a good use > case.
I apologize; looking back into my archives, it appears there was an entire subthread specifically about SearchMonkey! Also, Ian did indeed mention it in his first email in this thread. He actually gave it more attention than any other single use-case, though. I'll quote the relevant part: > On Tue, 26 Aug 2008, Ben Adida wrote: > > > > Here's one example. This is not the only way that RDFa can be helpful, > > but it should help make things more concrete: > > > > http://developer.yahoo.com/searchmonkey/ > > > > Using semantic markup in HTML (microformats and, soon, RDFa), you, as a > > publisher, can choose to surface more relevant information straight into > > Yahoo search results. > > This doesn't seem to require RDFa or any generic data syntax at all. Since > the system is site-specific anyway (you have to list the URLs you wish to > act against), the same kind of mechanism could be done by just extracting > the data straight out of the page. This would have the advantage of > working with any Web page without requiring the page to be written using a > particular syntax. > > However, if SearchMonkey is an example of a use case, then we should > determine the requirements for this feature. It seems, based on reading > the documentation, that it basically boils down to: > > * Pages should be able to expose nested lists of name-value pairs on a > page-by-page basis. > > * It should be possible to define globally-unique names, but the syntax > should be optimised for a set of predefined vocabularies. > > * Adding this data to a page should be easy. > > * The syntax for adding this data should encourage the data to remain > accurate when the page is changed. > > * The syntax should be resilient to intentional copy-and-paste authoring: > people copying data into the page from a page that already has data > should not have to know about any declarations far from the data. > > * The syntax should be resilient to unintentional copy-and-paste > authoring: people copying markup from the page who do not know about > these features should not inadvertently mark up their page with > inapplicable data. > > Are there any other requirements that we can derive from SearchMonkey? I agree with Ian in that SearchMonkey is not *necessarily* speaking in favor of RDFa; that may be what caused you to think he was dismissing it. In truth, Ian is merely trying to take current examples of RDFa use and distill them into their essence. (To grab my previous example, it is similar to seeing what all the various rounded-corners hacks were doing, without necessarily implying that the final solution will be anything like them. It's important to distill the actual problems that users are solving from the details of particular solutions they are using.) Like I said, I think SearchMonkey sounds absolutely awesome, and genuinely useful on a level I haven't yet seen any apps of similar nature reach. I'm exclusively a Google user, but that's something I'd love to have ported over. It's similar in nature to IE8's Accelerators, in that it's an opt-in application for users that reduces clicks to get to information they actively decide they want. However, Ian has a point in his first paragraph. SearchMonkey does *not* do auto-discovery; it relies entirely on site owners telling it precisely what data to extract, where it's allowed to extract it from, and how to present it. It is likely that this can be done entirely within the confines of current html, and the fact that SearchMonkey can use Microformats suggests that this is true. A possible approach is a site-owner producing an ad-hoc microformat (little m) that the crawler can match against pages and index the information of, and then offer to the SearchMonkey application for presentation as the developer wills. This would require specified parsing rules for such things (which, as mentioned in an earlier email, the big-m Microformats community is working on). The question is, would this be sufficient? Are other approaches easier for authors? RDFa, as noted, already has a specified parsing model. Does this make it easier for authors to design data templates? Easier to communicate templates to a crawler? Easier to deploy in a site? Easier to parse for a crawler? SearchMonkey makes mention of developers producing SearchMonkey apps without the explicit permission of site owners. This use would almost certainly be better served with a looser data discovery model than RDFa, so that a site owner doesn't have to explicitly comply in order for others to extract useful data from their pages. How important is this? These are precisely the sort of questions I think Ian wants and needs asked. SearchMonkey is an awesome app; do we need to do anything to support it and similar apps? *Can* anything we do support it, or is it best served by solutions that ignore us completely? Yes, SearchMonkey operates on metadata, and the problem space doesn't allow natural-language processing to stand in for it; it is not clear, though, that a strict markup approach is best for authors or users. Nevertheless, it is an excellent use-case to distill requirements from so we *can* determine if a spec-based solution is desirable. ~TJ