Hi Patrick,
On 22.9.2010 21:01, Patrick van Kleef wrote:
HI Florian,
I did some semantic web crawling and I gathered information on football
clubs from freebase and dbpedia.
In the dbpedia data, there is an individual that's equivalent
(owl:sameAs) http://dbpedia.org/resource/Chelsea_F.C. , namely
http://dbpedia.org/resource/Chelse_FC\
(note the backslash at the end of the URI)
When I use the sesame 2 API to access the virtuoso RDF store (directly
via the java repository), to collect all the owl:sameAs triples for
http://dbpedia.org/resource/Chelsea_F.C. like so
st = con.getStatements((Resource) ind, OWL.SAMEAS, null, true,
getContexts());
I get this error:
sparql select *
from named<http://myhost.com/ns/graph/linkeddata>
from named<http://myhost.com/ns/graph/soccer>
from named<http://myhost.com/ns/graph/tennis>
from named<http://myhost.com/ns/graph/entertainment-dbpedia>
from named<http://myhost.com/ns/common#inferred>
from named<http://myhost.com/ns/graph/dbpedia-ontology>
where {
graph ?g
{<http://dbpedia.org/resource/Chelse_FC\>
<http://www.w3.org/2002/07/owl#sameAs> ?o }
}
[virtuoso.jdbc3.VirtuosoException: SQ074: Line 1: syntax error at '\'
before 'SELECT']
I googled a bit and found these explanations:
http://docs.openlinksw.com/virtuoso/dbadm.html , specifically, under
6.1.9.1.9 [Client]:
SQL_NO_CHAR_C_ESCAPE=1
and gave it a shot (setting the conf var to 1), but the result was just
another error:
Could not open RepositoryConnection for transaction; nested
exception is org.openrdf.repository.RepositoryException:
virtuoso.jdbc3.VirtuosoException: Not using UTF-8 encoding of SQL
statements, but processing character escapes also disabled
I also tried setting
SQL_UTF8_EXECS = 1
in spite of reading in
http://docs.openlinksw.com/virtuoso/wideidentifiers.html that this may
make the whole content of my database unreadable - luckily, It didn't.
The error, however, prevailed.
Do I just need to re-populate my database with the new SQL_UTF8_EXECS =
1 setting or is it something else?
You cannot use a plain \ character in a URI, you need to urlencode it
like so:
sparql select *
from named<http://myhost.com/ns/graph/linkeddata>
from named<http://myhost.com/ns/graph/soccer>
from named<http://myhost.com/ns/graph/tennis>
from named<http://myhost.com/ns/graph/entertainment-dbpedia>
from named<http://myhost.com/ns/common#inferred>
from named<http://myhost.com/ns/graph/dbpedia-ontology>
where {
graph ?g
{<http://dbpedia.org/resource/Chelse_FC%5C>
<http://www.w3.org/2002/07/owl#sameAs> ?o }
}
Just like when you would use a URL your browser like:
http://dbpedia.org/page/Chelse_FC%5C
See also: http://en.wikipedia.org/wiki/Percent-encoding
The triple containing the offending URI was crawled by virtuoso, comes
from freebase ( http://rdf.freebase.com/rdf/en.chelsea_fc ), and the URI
can be found on dbpedia as well, so in fact it's a dbpedia bug, correct?
(Btw, I assume that they already know about the problem from this mail
http://www.mail-archive.com/[email protected]/msg00561.html
)
However, shouldn't virtuoso's crawler reject such URIs when encountered
during the crawling process so as to keep such bugs from spreading into
virtuoso-based LOD applications?
As for your suggestion of percent-escaping the backslash: Honestly I
don't know how I'd do that.
In the line that causes the error,
st = con.getStatements((Resource) ind, OWL.SAMEAS, null, true,
getContexts());
ind is an object of class org.openrdf.model.URI and it holds the value
http://dbpedia.org/resource/Chelse_FC\ (I checked via the thread
debugger, the string ends with a single backslash); con is a
RepositoryConnection object obtained from the
virtuoso.sesame2.driver.VirtuosoRepository instance.
The URI is fetched from the repository in a previous execution of the
above statement in a function that loads the transitive owl:sameAs -
closure for an individual URI.
I actually thought that any necessary escaping would be handled by the
sesame library (or the virtuoso implementation of the sesame
repo/connection). Do you suggest that I check every URI that I intend to
use in con.getStatements(...) and percent-encode any offending
characters? Aside from performance considerations, would that even work?
Thanks,
Florian