Re: Making human-friendly linked data pages more human-friendly

2009-09-17 Thread Paul A Houle
   I think there are a few scenarios here.

   In my mind,  dbpedia.org is a site for tripleheads.  I use it all the
time when I'm trying to understand how my systems interact with data from
dbpedia -- for that purpose,  it's useful to see a reasonably formatted list
of triples associated with an item.  A view that's isomorphic to the triples
is useful for me there.

   Yes, better interfaces for browsing dbpedia/wikipedia ought to be built
-- navigation along axes of type,  time,  and space would be obviously
interesting,  but making a usable interface for this involves some
challenges which are outside the scope of dbpedia.org;  The point of linked
data is anybody who wants to make a better browsing interface for dbpedia.

   Another scenario is a site that's ~primarily~ a site for humans and
secondly a site for tripleheads and machines,  for instance,

http://carpictures.cc/

   That particular site is built on an object-relational system which has
some (internal) RDF features.  The site was created by merging dbpedia,
freebase and other information sources,  so it exports linked data that
links dbpedia concepts to images with very high precision.  The primary
vocabulary is SIOC,  and the RDF content for a page is ~nearly~ isomorphic
to the content of the main part of the page (excluding the sidebar.)

   However,  there is content that's currently exclusive to the human
interface:  for instance,  the UI is highly visual:  for every automobile
make and model,  there are heuristics that try to pick a better than
average image at being both striking and representative of the brand.  This
selection is materialized in the database.  There's information designed to
give humans an information scent to help them navigate,  a concept which
isn't so well-defined for webcrawlers.  Then there's the sidebar,  which has
several purposes,  one of them being a navigational system for humans,  that
just isn't so relevant for machines.

   There really are two scenarios I see for linked data users relative to
this system at the moment:  (i) a webcrawler crawls the whole site,  or (ii)
I provide a service that,  given a linked data URL,  returns information
about what ontology2 knows about the URL.  For instance,  this could be used
by a system that's looking for multimedia connected with anything in dbpedia
or freebase.  Perhaps I should be offering an NT dump of the whole site,
but I've got no interest in offering a SPARQL endpoint.

   As for friendly interfaces,  I'd say take a look analytically at a page
like

http://carpictures.cc/cars/photo/car_make/21/Chevrolet

   What's going on here?  This is being done on a SQL-derivative system that
has a query builder,  but you could do the same thing w/ SPARQL.  We'd image
that there are some predicates like

hasCarModel
hasPhotograph
hasPreferredThumb

   starting with a URL that represents a make of car (a nameplate,  like
Chevrolet) we'd then traverse the hasCarModel relationship to enumerate the
models,  and then do a COUNT(*) of hasPhotograph relationships for the cars
to create a count of pictures for each model.  Generically,  the
construction of a page like this involves doing joins and traversing the
graph to show,  not just the triples that are linked to a named entity,  but
information that can be found by traversing a graph.
People shouldn't be shy about introducing their own predicates;  the very
nature of inference in RDF points to creating a new predicate as the basic
solution to most problems.  In this case,  hasPreferredThumb is a perfectly
good way to materialize the result of a complex heuristic.

(One reason I'm sour about public SPARQL endpoints is that I don't want to
damage my brand by encouraging amnesic mashups of my content;  a quality
site really needs a copy of it's own data so it can make additions,
corrections,  etc;  one major shortcoming of Web 2.0 has been self-serving
API TOS that forbid systems from keeping a memory -- for instance,  Ebay
doesn't let you make a price tracker or a system that keeps dossiers on
sellers.  Del.icio.us makes it easy to put data in,  but you can't get
anything interesting out.  Web 3.0 has to make a clean break from this.)

Database-backed sites traditionally do this with a mixture of declarative
SQL code and procedural code to create a view...  It would be interesting to
see RDF systems where the graph traversal is specified and transformed into
a website declaritively.


Re: Making human-friendly linked data pages more human-friendly

2009-09-17 Thread Paul A Houle
On Thu, Sep 17, 2009 at 7:23 AM, Kingsley Idehen kide...@openlinksw.comwrote:



 This is basically an aspect of the whole Linked Data meme that is lost on
 too many.


I've got to thank the book by Allemang and Hendler

http://www.amazon.com/Semantic-Web-Working-Ontologist-Effective/dp/0123735564

for setting me straight about data modeling in RDF.  RDFS and OWL are based
on a system of duck typing that turns conventional object or
object-relational thinking inside out.  It's not necessarily good or bad,
but it's really different.  Even though types matter,  predicates come
before types because using predicate A can make object B become a member of
type C,  even if A is never explicitly put in class C.

Looking at the predicates in RDFS or OWL and not understanding the whole,
it's pretty easy to be like oh,  this isn't too different from a relational
database and miss the point that RDFSOWL is much more about inference
(creating new triples) than it is about constraints or the physical layout
of the data.

One consequence of this is that using an existing predicate can drag in a
lot more baggage than you might want;  it's pretty easy to get the inference
engine to infer too much,  and false inferences can snowball like a
katamari.

A lot of people are in the habit of reusing vocabularies and seem to forget
that the natural answer to most RDF modeling problems is to create a new
predicate.  OWL has a rich set of mechanisms that can tell systems that

x A y - x B y

where A is your new predicate and B is a well-known predicate.  Once you
merge two almost-but-not-the-same things by actually using the same
predicate,  it's very hard to fix the damage.  If you use inference,  it's
easy to change your mind.

--

It may be different with other data sets,  but data cleaning is absolutely
essential working with dbpedia if you want to make production-quality
systems.

For instance,  all of the time people build bizapps and they need a list of
US states...  Usually we go and cut and paste one from somewhere...  But now
I've got dbpedia and I should be able to do this systematically.  There's a
category in wikipedia for that...

http://en.wikipedia.org/wiki/Category:States_of_the_United_States

if you ignore the subcategories and just take the actual pages,  it's
(almost) what you need,  except for some weirdos like

User:Beebarose/Alabama http://en.wikipedia.org/wiki/User:Beebarose/Alabama

and one state that's got a disambiguator in the name:

Georgia (U.S. state) http://en.wikipedia.org/wiki/Georgia_%28U.S._state%29

It's not hard to clean up this list,  but it takes some effort,  and
ultimately you're probably going to materialize something new.

These sorts of issues even turn up in highly clean data sets.  Once I built
a webapp that had a list of countries in it,  this was used to draw a
dropdown list,  but the dropdown list was excessively wide,  busting the
layout of the site.  Now,  the list was really long because there were a few
authoritarian countries with long and flowery names.  The transformation
from

*Democratic People's Republic of Korea - North Korea

*improved the usability of the site while eliminating Orwellian language.
This kind of fit and finish is needed to make quality sites,  and semweb
systems are going to need automated and manual ways of fixing this so that
Web 3.0 looks like a step forward,  not a step back.


Re: Making human-friendly linked data pages more human-friendly

2009-09-17 Thread Paul A Houle
On Thu, Sep 17, 2009 at 12:19 PM, Kingsley Idehen kide...@openlinksw.comwrote:

Schema Last vs. Schema First :-) An RDF virtue that once broadly understood,
 across the more traditional DBMS realms, will work wonders for RDF based
 Linked Data appreciation.


That's the conclusion that I'm coming to.

I've been think of the question of,  what would Cyc look like if it were
started today?

Cyc took the Schema First approach to the human memome project:  as a
result it put a lot of work into upper and middle ontologies which don't
seem all that useful to many observers.  Despite a great deal of effort put
into avoiding 'representational thorns',  it got caught up.

A modern approach would be to start with a huge amount of data over various
domains and to construct schemas using a mix of statistical inference and
human input.  The role of the upper ontology would be reduced here,
because,  in general,  it isn't always necessary to mesh up two randomly
chosen domains,  say:  bus schedules,  anime,  psychoanalysis,
particle physics

Now,  somebody might want to apply the system to study the relationship of
anime with psychoanalysis;  that could be approached by constructing a
metatheory (i) based on those particular domains,  and (ii) conditioned by
the application that the system is being put to,  that is,  on the bit,
connected via a feedback loop to some means of evaluating the system's
motion towards a goal.

Representational Thorns get bypassed here because the system is free to
develop a new representation if an old one fails for a particular task.