Re: [Wikitech-l] Will the real URI stand up? [dbpedia vs wikipedia vs the world]

Brion Vibber Wed, 13 Oct 2010 15:15:29 -0700

On Wed, Oct 13, 2010 at 2:29 PM, Paul Houle wrote:
[snip]

> Now the problem I've got with the Ookaboo API is that I know people are
> going to punch in
>
> http://wikipedia.org/page/Boston,_MA
>
> and I need to turn this into the right dbpedia URL.  My plan for dealing
> with this is to
>
> (i) store the exact URI I get out of dbpedia,
> (ii) always give people the exact URI out of dbpedia (if I publish RDFa
> or JSON data),
> (iii) give the same URI for wikipedia that dbpedia gives (in HTML,
> RDFa,  etc.)
> (iv) if I get a query,  apply the same canonicalization rules that
> dbpedia uses...
>
> Which begs the question of what exactly those rules are.  What are they?
>


It sounds like you need to map from "URL which contains an English Wikipedia
article title" to "URI identifier for the DBPedia node describing the
concept which that article is about".

The good news is that you can probably get away without caring too much
about the actual encoding in the source Wikipedia URL you're looking at. :)

DBPedia's documentation says that their resource URIs are "of the form
http://dbpedia.org/resource/Name, where Name is taken from the URL of the
source Wikipedia article, which has the form
http://en.wikipedia.org/wiki/Name. Thus, each resource is tied directly
to an English-language Wikipedia article." --
http://wiki.dbpedia.org/Datasets#h18-4

They may, or may not, actually mean that as regards to normalization of
%-encoding... let's assume that they do indeed copy it exactly.

My off-the-cuff recommendation might be something like this:

1) Resolve the redirect:
  a) fetch the URL, following any HTTP redirects -- this will let you avoid
worrying about domain aliases you don't recognize, etc
  b) grab the <link rel="canonical"> url if any -- this will resolve any
in-wiki redirects for you

2) Get the article name!
  a) confirm the URL is in the format you expect:
http://en.wikipedia.org/wiki/(.*)
  b) divide the title from the rest of the URL

In a sane world, you'd unescape the %-encoding here, replace underscores to
spaces, then take the UTF-8 name and run it through what DBPedia specifies
as their encoding style. But if they're just copying the URL fragments from
Wikipedia direct, you can just take the string now. :)

3) Prepend http://dbpedia.org/resource/ to the URL fragment.

Alternately, it wouldn't surprise me if DBPedia contained metadata or a
search helper to look up by Wikipedia article name, but I can't get any of
the SPARQL examples I've found on the web to work on their online lookup
just now so I'm too lazy to go looking further. ;)

-- brion
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Will the real URI stand up? [dbpedia vs wikipedia vs the world]

Reply via email to