Hi Arne,

I was requested by Hugh Williams and Jens Lehmann to send an email to this mailing list regarding the unstable performance of the dbpedia sparql endpoint. I’m currently working on a project for the EU that requires me to use LOD to retrieve information about cities and Eurostat data. For the information about the cities I’m relying on the dbpedia project to retrieve the info I need.

The problem is that the endpoint is behaving in a weird fashion at the moment. At certain times, I receive information reasonably fast from the endpoint, but in most cases the endpoint either times out or doesn’t return any data at all.

I assume your code does check the HTTP status result codes?

There are a number of reasons why the dbpedia endpoint can give timeouts such as:

1. you send too many requests in very short amount of time
2. you send very time-consuming queries and you receive a timeout
3. someone else is doing some very expensive queries

We have several ACLs in place to deal with the above scenarios but that does not mean that some users write crawlers that ignore the HTTP status codes and properly act upon them. We are looking into the best way to deal with such.

A few examples of cities that I’m working with can be located here:

-        http://dbpedia.org/resource/Antwerp
-        http://dbpedia.org/resource/Düsseldorf
-        http://dbpedia.org/resource/London

From the datasets, I’m able to deduct that I have a thumbnail property and abstract property, and this is the information that I require in the project. Using Ruby on Rails, I have created some classes that allow me to communicate with the SPARQL endpoint located at http://dbpedia.org/sparql
Using a simply GET request I send the following query to the endpoint:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX prop: <http://dbpedia.org/property/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?thumbnail ?abstract
WHERE {
  ?location rdfs:label ?label;
 a dbo:PopulatedPlace
  OPTIONAL { ?location dbo:thumbnail ?thumbnail . }
OPTIONAL { ?location dbo:abstract ?abstract . FILTER langMatches(lang(?abstract), 'en')}
  FILTER regex(?label, 'Swanage', 'i')
} LIMIT 1

Where the language and name of town are beeing injected by the code. In some cases this returns information in other cases this returns northing. In the case of multiple results I’m only interested in the first match. I’m using the OPTIONAL parts because I don’t know if the properties are always there, it’s just a safeguard to prevent exceptions on my side.

If I’m doing something wrong with the query, please let me know and I’ll change it, but my understanding is that the endpoint is not working as intended.


The performance killer in your query is the use of:

  FILTER regex(?label, 'Swanage', 'i')

which is basically will need to check every individual place triple to see if it matches which is not a very efficient way to quickly get the results.


I suggest you replace this FILTER line and replace it with:

        ?label bif:contains '\'Düsseldorf\''

or

        ?label bif:contains " 'Antwerpen' "


which takes into account the fact there can be special characters in names. If you are dealing only with Latin1 characters, you can leave out the inner quotes. Also note the spaces between the double and single quote are not needed, but enhance readability.

The CONTAINS function has the same advantage as the REGEX (x, 'i') option of finding both antwerpen, Antwerpen or any other case mixes but using an index which is much faster.

so this would mean your query would look like:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX prop: <http://dbpedia.org/property/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?thumbnail ?abstract
WHERE {
    ?location rdfs:label ?label;
    a dbo:PopulatedPlace.
    ?label bif:contains " 'Antwerpen' ".
    OPTIONAL { ?location dbo:thumbnail ?thumbnail . }
OPTIONAL { ?location dbo:abstract ?abstract . FILTER langMatches(lang(?abstract), 'en')}
} LIMIT 1


which is much faster.


Another trick you can use is to turn a SPARQL request into an ANYTIME query. This is done by adding
        
        &timeout=5000

to the end of the /sparql/?query=XXXX request which instructs the Virtuoso SPARQL endpoint to only return results that it can find in approximate 5000msec. There are special HTTP result header flags that indicate if the resultset is a partial or full result.

See also:
http://docs.openlinksw.com/virtuoso/anytimequeries.html


Patrick



Reply via email to