Re: Jena TDB indexing and stats building

Rob Vesse Wed, 11 Jan 2017 01:02:43 -0800

Comments in line:

On 10/01/2017 21:56, "Ganesh Selvaraj" <[email protected]> wrote:

    Thank you.
    Now I have loaded data using method tdbloader.main(....), and it has
    created me index and stats.

    I have a query which I am executing, and I feel like the optimizer is not
    optimising the query. Can you advice me if I am using it the right way ?

The optimiser is enabled by default. However, the optimiser does not magically 
make every query fast! Some queries are just fundamentally hard particularly 
when applied to large datasets. You appear to be using LUBM but you haven’t 
mentioned what scale factor you use.

Specifically, this appears to be query 2 rather than Query 1 as the code would 
imply. Query 2 finds triangles in the data which is a large and complex join 
that does not scale well.

 For queries like this which are fundamentally hard the optimiser and 
statistics Will only make so much difference. On this kind of query you are 
better off throwing more computer resources i.e. Memory at it. Next that most 
memory usage for TDB is off heap so setting the heap size too high Will 
actually reduce performance.

 Also if you’re comparing this query to other queries performance please bear 
in mind that most of the LUBM queries require inference to answer. Queries 4 
through 13 cannot be answered without inference and so will give very fast but 
incorrect i.e. empty answers regardless of the scale of dataset they are 
applied upon.

    This is the method;

    public void testLUBMQuery1_original() {

    long duration = 0l;

    Date startTime, endTime;

    startTime = new Date();

    String sparqlQueryString = "PREFIX ub: <
    http://swat.cse.lehigh.edu/onto/univ-bench.owl#> "

    + "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> "

    + "SELECT ?X ?Y ?Z WHERE { "

    + "?Z ub:subOrganizationOf ?Y. "

    + "?Y rdf:type ub:University. "

    + "?Z rdf:type ub:Department. "

    + "?X ub:memberOf ?Z. "

    + "?X rdf:type ub:GraduateStudent. "

    + "?X ub:undergraduateDegreeFrom ?Y. }";

    Query query = QueryFactory.create(sparqlQueryString);

    //Tried with and without the Algebra step

    Op op = Algebra.compile(query) ;

 This step happens automatically, there is no need to do it yourself

    query = OpAsQuery.asQuery(op);

 this step is entirely unnecessary

    QueryExecution qexec = QueryExecutionFactory.create(query, dataset);

    ResultSet results = qexec.execSelect();

    ResultSetFormatter.out(results);

    endTime = new Date();

What are you are actually trying to time? Since you write out the results your 
timing both the time to execute the query and the time to format results. At 
larger scales the time to format can be huge

More specifically you are asking for text output which creates an ASCII table 
which requires multiple passes over the data in order to determine the column 
sizes.

I would suggest that you simply iterate over the result set to consume it (you 
can use ResultSetFormatter.consume() for this) if you want to time just a query 
execution.

There are tools for benchmarking if you are interested that will give you much 
more accurate and reliable results:

https://github.com/rvesse/sparql-query-bm

Rob

    duration = endTime.getTime() - startTime.getTime();

    System.out.println(query.toString());

    System.out.println("Original Query 1 Duration: " + duration );

    }

    Thanks Again.

    Best,

    Ganesh

    On 10 January 2017 at 11:37, Andy Seaborne <[email protected]> wrote:

    >
    >
    > On 09/01/17 19:40, A. Soroka wrote:
    >
    >> The layout of the statistics file is documented here:
    >>
    >> https://jena.apache.org/documentation/tdb/optimizer.html#
    >> statistics-rule-file
    >>
    >> tdbloader and tdbloader2 are the CLI utilities for building TDB
    >> databases, but they are written in Java and can be used in Java.
    >>
    >> https://jena.apache.org/documentation/tdb/commands.html#tdbloader
    >>
    >>
    > Jena is open source and maven central has source artifacts that you IDE
    > will automatically attach to your projects.
    >
    > See the package:
    > org.apache.jena.tdb.solver.stats;
    >
    >         Andy
    >
    >
    > ---
    >> A. Soroka
    >> The University of Virginia Library
    >>
    >> On Jan 9, 2017, at 2:36 PM, Ganesh Selvaraj <[email protected]>
    >>> wrote:
    >>>
    >>> Hi All,
    >>>
    >>> I am using Jena TDB for my work. So far I could not find much
    >>> documentation
    >>> on data indexing and statistics building for Jena TDB.
    >>>
    >>> I would prefer doing it via a Java API.
    >>>
    >>> Any help/documentation is appreciated.
    >>>
    >>> Thanks
    >>> Ganesh
    >>>
    >>
    >>

Re: Jena TDB indexing and stats building

Reply via email to