Re: Making human-friendly linked data pages more human-friendly
I think there are a few scenarios here. In my mind, dbpedia.org is a site for tripleheads. I use it all the time when I'm trying to understand how my systems interact with data from dbpedia -- for that purpose, it's useful to see a reasonably formatted list of triples associated with an item. A view that's isomorphic to the triples is useful for me there. Yes, better interfaces for browsing dbpedia/wikipedia ought to be built -- navigation along axes of type, time, and space would be obviously interesting, but making a usable interface for this involves some challenges which are outside the scope of dbpedia.org; The point of linked data is anybody who wants to make a better browsing interface for dbpedia. Another scenario is a site that's ~primarily~ a site for humans and secondly a site for tripleheads and machines, for instance, http://carpictures.cc/ That particular site is built on an object-relational system which has some (internal) RDF features. The site was created by merging dbpedia, freebase and other information sources, so it exports linked data that links dbpedia concepts to images with very high precision. The primary vocabulary is SIOC, and the RDF content for a page is ~nearly~ isomorphic to the content of the main part of the page (excluding the sidebar.) However, there is content that's currently exclusive to the human interface: for instance, the UI is highly visual: for every automobile make and model, there are heuristics that try to pick a better than average image at being both striking and representative of the brand. This selection is materialized in the database. There's information designed to give humans an information scent to help them navigate, a concept which isn't so well-defined for webcrawlers. Then there's the sidebar, which has several purposes, one of them being a navigational system for humans, that just isn't so relevant for machines. There really are two scenarios I see for linked data users relative to this system at the moment: (i) a webcrawler crawls the whole site, or (ii) I provide a service that, given a linked data URL, returns information about what ontology2 knows about the URL. For instance, this could be used by a system that's looking for multimedia connected with anything in dbpedia or freebase. Perhaps I should be offering an NT dump of the whole site, but I've got no interest in offering a SPARQL endpoint. As for friendly interfaces, I'd say take a look analytically at a page like http://carpictures.cc/cars/photo/car_make/21/Chevrolet What's going on here? This is being done on a SQL-derivative system that has a query builder, but you could do the same thing w/ SPARQL. We'd image that there are some predicates like hasCarModel hasPhotograph hasPreferredThumb starting with a URL that represents a make of car (a nameplate, like Chevrolet) we'd then traverse the hasCarModel relationship to enumerate the models, and then do a COUNT(*) of hasPhotograph relationships for the cars to create a count of pictures for each model. Generically, the construction of a page like this involves doing joins and traversing the graph to show, not just the triples that are linked to a named entity, but information that can be found by traversing a graph. People shouldn't be shy about introducing their own predicates; the very nature of inference in RDF points to creating a new predicate as the basic solution to most problems. In this case, hasPreferredThumb is a perfectly good way to materialize the result of a complex heuristic. (One reason I'm sour about public SPARQL endpoints is that I don't want to damage my brand by encouraging amnesic mashups of my content; a quality site really needs a copy of it's own data so it can make additions, corrections, etc; one major shortcoming of Web 2.0 has been self-serving API TOS that forbid systems from keeping a memory -- for instance, Ebay doesn't let you make a price tracker or a system that keeps dossiers on sellers. Del.icio.us makes it easy to put data in, but you can't get anything interesting out. Web 3.0 has to make a clean break from this.) Database-backed sites traditionally do this with a mixture of declarative SQL code and procedural code to create a view... It would be interesting to see RDF systems where the graph traversal is specified and transformed into a website declaritively.
Re: Making human-friendly linked data pages more human-friendly
On Thu, Sep 17, 2009 at 7:23 AM, Kingsley Idehen kide...@openlinksw.comwrote: This is basically an aspect of the whole Linked Data meme that is lost on too many. I've got to thank the book by Allemang and Hendler http://www.amazon.com/Semantic-Web-Working-Ontologist-Effective/dp/0123735564 for setting me straight about data modeling in RDF. RDFS and OWL are based on a system of duck typing that turns conventional object or object-relational thinking inside out. It's not necessarily good or bad, but it's really different. Even though types matter, predicates come before types because using predicate A can make object B become a member of type C, even if A is never explicitly put in class C. Looking at the predicates in RDFS or OWL and not understanding the whole, it's pretty easy to be like oh, this isn't too different from a relational database and miss the point that RDFSOWL is much more about inference (creating new triples) than it is about constraints or the physical layout of the data. One consequence of this is that using an existing predicate can drag in a lot more baggage than you might want; it's pretty easy to get the inference engine to infer too much, and false inferences can snowball like a katamari. A lot of people are in the habit of reusing vocabularies and seem to forget that the natural answer to most RDF modeling problems is to create a new predicate. OWL has a rich set of mechanisms that can tell systems that x A y - x B y where A is your new predicate and B is a well-known predicate. Once you merge two almost-but-not-the-same things by actually using the same predicate, it's very hard to fix the damage. If you use inference, it's easy to change your mind. -- It may be different with other data sets, but data cleaning is absolutely essential working with dbpedia if you want to make production-quality systems. For instance, all of the time people build bizapps and they need a list of US states... Usually we go and cut and paste one from somewhere... But now I've got dbpedia and I should be able to do this systematically. There's a category in wikipedia for that... http://en.wikipedia.org/wiki/Category:States_of_the_United_States if you ignore the subcategories and just take the actual pages, it's (almost) what you need, except for some weirdos like User:Beebarose/Alabama http://en.wikipedia.org/wiki/User:Beebarose/Alabama and one state that's got a disambiguator in the name: Georgia (U.S. state) http://en.wikipedia.org/wiki/Georgia_%28U.S._state%29 It's not hard to clean up this list, but it takes some effort, and ultimately you're probably going to materialize something new. These sorts of issues even turn up in highly clean data sets. Once I built a webapp that had a list of countries in it, this was used to draw a dropdown list, but the dropdown list was excessively wide, busting the layout of the site. Now, the list was really long because there were a few authoritarian countries with long and flowery names. The transformation from *Democratic People's Republic of Korea - North Korea *improved the usability of the site while eliminating Orwellian language. This kind of fit and finish is needed to make quality sites, and semweb systems are going to need automated and manual ways of fixing this so that Web 3.0 looks like a step forward, not a step back.
Re: Making human-friendly linked data pages more human-friendly
On Thu, Sep 17, 2009 at 12:19 PM, Kingsley Idehen kide...@openlinksw.comwrote: Schema Last vs. Schema First :-) An RDF virtue that once broadly understood, across the more traditional DBMS realms, will work wonders for RDF based Linked Data appreciation. That's the conclusion that I'm coming to. I've been think of the question of, what would Cyc look like if it were started today? Cyc took the Schema First approach to the human memome project: as a result it put a lot of work into upper and middle ontologies which don't seem all that useful to many observers. Despite a great deal of effort put into avoiding 'representational thorns', it got caught up. A modern approach would be to start with a huge amount of data over various domains and to construct schemas using a mix of statistical inference and human input. The role of the upper ontology would be reduced here, because, in general, it isn't always necessary to mesh up two randomly chosen domains, say: bus schedules, anime, psychoanalysis, particle physics Now, somebody might want to apply the system to study the relationship of anime with psychoanalysis; that could be approached by constructing a metatheory (i) based on those particular domains, and (ii) conditioned by the application that the system is being put to, that is, on the bit, connected via a feedback loop to some means of evaluating the system's motion towards a goal. Representational Thorns get bypassed here because the system is free to develop a new representation if an old one fails for a particular task.