Hi Patrick,

On 22.9.2010 21:01, Patrick van Kleef wrote:
HI Florian,

I did some semantic web crawling and I gathered information on football
clubs from freebase and dbpedia.
In the dbpedia data, there is an individual that's equivalent
(owl:sameAs) http://dbpedia.org/resource/Chelsea_F.C. , namely
http://dbpedia.org/resource/Chelse_FC\
(note the backslash at the end of the URI)

When I use the sesame 2 API to access the virtuoso RDF store (directly
via the java repository), to collect all the owl:sameAs triples for
http://dbpedia.org/resource/Chelsea_F.C. like so

st = con.getStatements((Resource) ind, OWL.SAMEAS, null, true,
getContexts());

I get this error:

sparql select *
from named<http://myhost.com/ns/graph/linkeddata>
from named<http://myhost.com/ns/graph/soccer>
from named<http://myhost.com/ns/graph/tennis>
from named<http://myhost.com/ns/graph/entertainment-dbpedia>
from named<http://myhost.com/ns/common#inferred>
from named<http://myhost.com/ns/graph/dbpedia-ontology>
where {
  graph ?g
{<http://dbpedia.org/resource/Chelse_FC\> <http://www.w3.org/2002/07/owl#sameAs> ?o }
  }
[virtuoso.jdbc3.VirtuosoException: SQ074: Line 1: syntax error at '\' before 'SELECT']

I googled a bit and found these explanations:
http://docs.openlinksw.com/virtuoso/dbadm.html , specifically, under
6.1.9.1.9 [Client]:

SQL_NO_CHAR_C_ESCAPE=1

and gave it a shot (setting the conf var to 1), but the result was just
another error:

Could not open RepositoryConnection for transaction; nested exception is org.openrdf.repository.RepositoryException: virtuoso.jdbc3.VirtuosoException: Not using UTF-8 encoding of SQL statements, but processing character escapes also disabled


I also tried setting
SQL_UTF8_EXECS = 1
in spite of reading in
http://docs.openlinksw.com/virtuoso/wideidentifiers.html that this may
make the whole content of my database unreadable - luckily, It didn't.
The error, however, prevailed.

Do I just need to re-populate my database with the new SQL_UTF8_EXECS =
1 setting or is it something else?

You cannot use a plain \ character in a URI, you need to urlencode it like so:

sparql select *
from named<http://myhost.com/ns/graph/linkeddata>
from named<http://myhost.com/ns/graph/soccer>
from named<http://myhost.com/ns/graph/tennis>
from named<http://myhost.com/ns/graph/entertainment-dbpedia>
from named<http://myhost.com/ns/common#inferred>
from named<http://myhost.com/ns/graph/dbpedia-ontology>
where {
  graph ?g
{<http://dbpedia.org/resource/Chelse_FC%5C> <http://www.w3.org/2002/07/owl#sameAs> ?o }
  }

Just like when you would use a URL your browser like:

   http://dbpedia.org/page/Chelse_FC%5C


See also: http://en.wikipedia.org/wiki/Percent-encoding


The triple containing the offending URI was crawled by virtuoso, comes from freebase ( http://rdf.freebase.com/rdf/en.chelsea_fc ), and the URI can be found on dbpedia as well, so in fact it's a dbpedia bug, correct? (Btw, I assume that they already know about the problem from this mail http://www.mail-archive.com/[email protected]/msg00561.html )

However, shouldn't virtuoso's crawler reject such URIs when encountered during the crawling process so as to keep such bugs from spreading into virtuoso-based LOD applications?

As for your suggestion of percent-escaping the backslash: Honestly I don't know how I'd do that.

In the line that causes the error,
st = con.getStatements((Resource) ind, OWL.SAMEAS, null, true,
getContexts());
ind is an object of class org.openrdf.model.URI and it holds the value http://dbpedia.org/resource/Chelse_FC\ (I checked via the thread debugger, the string ends with a single backslash); con is a RepositoryConnection object obtained from the virtuoso.sesame2.driver.VirtuosoRepository instance. The URI is fetched from the repository in a previous execution of the above statement in a function that loads the transitive owl:sameAs - closure for an individual URI.

I actually thought that any necessary escaping would be handled by the sesame library (or the virtuoso implementation of the sesame repo/connection). Do you suggest that I check every URI that I intend to use in con.getStatements(...) and percent-encode any offending characters? Aside from performance considerations, would that even work?


Thanks,
Florian



Reply via email to