Comments in line:
On 10/01/2017 21:56, "Ganesh Selvaraj" <[email protected]> wrote:
Thank you.
Now I have loaded data using method tdbloader.main(....), and it has
created me index and stats.
I have a query which I am executing, and I feel like the optimizer is not
optimising the query. Can you advice me if I am using it the right way ?
The optimiser is enabled by default. However, the optimiser does not magically
make every query fast! Some queries are just fundamentally hard particularly
when applied to large datasets. You appear to be using LUBM but you haven’t
mentioned what scale factor you use.
Specifically, this appears to be query 2 rather than Query 1 as the code would
imply. Query 2 finds triangles in the data which is a large and complex join
that does not scale well.
For queries like this which are fundamentally hard the optimiser and
statistics Will only make so much difference. On this kind of query you are
better off throwing more computer resources i.e. Memory at it. Next that most
memory usage for TDB is off heap so setting the heap size too high Will
actually reduce performance.
Also if you’re comparing this query to other queries performance please bear
in mind that most of the LUBM queries require inference to answer. Queries 4
through 13 cannot be answered without inference and so will give very fast but
incorrect i.e. empty answers regardless of the scale of dataset they are
applied upon.
This is the method;
public void testLUBMQuery1_original() {
long duration = 0l;
Date startTime, endTime;
startTime = new Date();
String sparqlQueryString = "PREFIX ub: <
http://swat.cse.lehigh.edu/onto/univ-bench.owl#> "
+ "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> "
+ "SELECT ?X ?Y ?Z WHERE { "
+ "?Z ub:subOrganizationOf ?Y. "
+ "?Y rdf:type ub:University. "
+ "?Z rdf:type ub:Department. "
+ "?X ub:memberOf ?Z. "
+ "?X rdf:type ub:GraduateStudent. "
+ "?X ub:undergraduateDegreeFrom ?Y. }";
Query query = QueryFactory.create(sparqlQueryString);
//Tried with and without the Algebra step
Op op = Algebra.compile(query) ;
This step happens automatically, there is no need to do it yourself
query = OpAsQuery.asQuery(op);
this step is entirely unnecessary
QueryExecution qexec = QueryExecutionFactory.create(query, dataset);
ResultSet results = qexec.execSelect();
ResultSetFormatter.out(results);
endTime = new Date();
What are you are actually trying to time? Since you write out the results your
timing both the time to execute the query and the time to format results. At
larger scales the time to format can be huge
More specifically you are asking for text output which creates an ASCII table
which requires multiple passes over the data in order to determine the column
sizes.
I would suggest that you simply iterate over the result set to consume it (you
can use ResultSetFormatter.consume() for this) if you want to time just a query
execution.
There are tools for benchmarking if you are interested that will give you much
more accurate and reliable results:
https://github.com/rvesse/sparql-query-bm
Rob
duration = endTime.getTime() - startTime.getTime();
System.out.println(query.toString());
System.out.println("Original Query 1 Duration: " + duration );
}
Thanks Again.
Best,
Ganesh
On 10 January 2017 at 11:37, Andy Seaborne <[email protected]> wrote:
>
>
> On 09/01/17 19:40, A. Soroka wrote:
>
>> The layout of the statistics file is documented here:
>>
>> https://jena.apache.org/documentation/tdb/optimizer.html#
>> statistics-rule-file
>>
>> tdbloader and tdbloader2 are the CLI utilities for building TDB
>> databases, but they are written in Java and can be used in Java.
>>
>> https://jena.apache.org/documentation/tdb/commands.html#tdbloader
>>
>>
> Jena is open source and maven central has source artifacts that you IDE
> will automatically attach to your projects.
>
> See the package:
> org.apache.jena.tdb.solver.stats;
>
> Andy
>
>
> ---
>> A. Soroka
>> The University of Virginia Library
>>
>> On Jan 9, 2017, at 2:36 PM, Ganesh Selvaraj <[email protected]>
>>> wrote:
>>>
>>> Hi All,
>>>
>>> I am using Jena TDB for my work. So far I could not find much
>>> documentation
>>> on data indexing and statistics building for Jena TDB.
>>>
>>> I would prefer doing it via a Java API.
>>>
>>> Any help/documentation is appreciated.
>>>
>>> Thanks
>>> Ganesh
>>>
>>
>>