> -----Original Message----- > From: Peter Ansell [mailto:ansell.pe...@gmail.com] > Sent: 22 November 2008 21:54 > To: Kingsley Idehen > Cc: dbpedia-discuss...@lists.sourceforge.net; virtuoso- > us...@lists.sourceforge.net > Subject: Re: [Dbpedia-discussion] DBPedia 3.2 Load in Virtuoso 5.0.9 - > Reporting on results, and some questions
> > Duplicates! > > Can someone please explain this? > > > > As a side, when I run this from isql on my newly locally installed dbpedia > I get no duplicates (I havent tried Jena with my local). > > > > > > <eom> > > > > Kingsley wrote: > Marvin, > > You will see why when you run: > > select * > where {graph ?g { > ?s > <http://dbpedia.org/property/influenced> > <http://dbpedia.org/resource/Chris_Rock> > }} > > As you can see their are two graphs: > 1. http://dbpedia.org > 2. http://dbpedia.org/resource/<entity> (this one results from cache > activity associated with client interactions with Virtuoso) > > Solutions: > -- Being specific about source Graph by specifying Graph IRI > select ?s > where {graph <http://dbpedia.org> { > ?s > <http://dbpedia.org/property/influenced> > <http://dbpedia.org/resource/Chris_Rock> > }} > OR > > select ?s > from <http://dbpedia.org> > where { > ?s > <http://dbpedia.org/property/influenced> > <http://dbpedia.org/resource/Chris_Rock> > } > -- Using DISTINCT > > select distinct ?s > where { > ?s > <http://dbpedia.org/property/influenced> > <http://dbpedia.org/resource/Chris_Rock> > } > Peter wrote: > What is the instruction to give with Jena/Other clients etc. to make it > behave in the same way as the HTTP SPARQL page interface and not resolve > triples from the cache graphs. For Jena, when a call of: qexec = QueryExecutionFactory.sparqlService("http://DBpedia.org/sparql", q); is made, the query is passed as-is to the SPARQL endpoint. The result set comes back as SPARQL results Format and is parsed to produce the local programming objects. There no additional process client-side. Duplicates should not come back from that pattern but the client-side code does not check that the endpoint is functioning correctly. In SPARQL, matching a basic graph pattern or a triple pattern and one variable does not give duplicates because an RDF graph is a set of triples. (It is only possible if the pattern includes a blank node - think of that as a variable that is projected away and like an projection, can result in duplicates across the narrower intermediate result). If a union of other graphs are underlying the virtual graph then the compound graph should still appear to be a set of statements which will not produce duplicates. By just passing over the query as-is, there's an assumption the endpoint will respect those semantics It would requite changing the query to suppress duplicates, e.g. using DISTINCT. In Jena this happens in quite a few places: we have union graphs, and the inference engines would produce duplicates if they didn't suppress them. The storage layers SDB and TDB [*] both support query over the union of named graphs in an RDF datasets and both suppress duplicates that occur to give the set-of-triples view.) Andy [*] In the SVN only. It didn't make the last release. > > Cheers, > > Peter