Re: Jena TDB indexing and stats building

Ganesh Selvaraj Wed, 11 Jan 2017 16:24:38 -0800

Thanks Rob. I will move the result fomatting outside the time measurement
logic. The dataset is not huge, its just approx 1 million triples. Jena TDB
responded this query in approx 2 seconds but when I printed the query to
see the new optimised join-order it was the same as the original input
query. Now I think from your answer thats an ok behaviour.


Regarding the LUBM query, these are different queries than the one in LUBM
site. These was published by openRDF.

Best,
Ganesh

On 11 January 2017 at 22:01, Rob Vesse <[email protected]> wrote:

> Comments in line:
>
> On 10/01/2017 21:56, "Ganesh Selvaraj" <[email protected]> wrote:
>
>     Thank you.
>     Now I have loaded data using method tdbloader.main(....), and it has
>     created me index and stats.
>
>     I have a query which I am executing, and I feel like the optimizer is
> not
>     optimising the query. Can you advice me if I am using it the right way
> ?
>
> The optimiser is enabled by default. However, the optimiser does not
> magically make every query fast! Some queries are just fundamentally hard
> particularly when applied to large datasets. You appear to be using LUBM
> but you haven’t mentioned what scale factor you use.
>
> Specifically, this appears to be query 2 rather than Query 1 as the code
> would imply. Query 2 finds triangles in the data which is a large and
> complex join that does not scale well.
>
>  For queries like this which are fundamentally hard the optimiser and
> statistics Will only make so much difference. On this kind of query you are
> better off throwing more computer resources i.e. Memory at it. Next that
> most memory usage for TDB is off heap so setting the heap size too high
> Will actually reduce performance.
>
>  Also if you’re comparing this query to other queries performance please
> bear in mind that most of the LUBM queries require inference to answer.
> Queries 4 through 13 cannot be answered without inference and so will give
> very fast but incorrect i.e. empty answers regardless of the scale of
> dataset they are applied upon.
>
>     This is the method;
>
>
>
>     public void testLUBMQuery1_original() {
>
>     long duration = 0l;
>
>     Date startTime, endTime;
>
>     startTime = new Date();
>
>
>     String sparqlQueryString = "PREFIX ub: <
>     http://swat.cse.lehigh.edu/onto/univ-bench.owl#> "
>
>     + "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> "
>
>     + "SELECT ?X ?Y ?Z WHERE { "
>
>     + "?Z ub:subOrganizationOf ?Y. "
>
>     + "?Y rdf:type ub:University. "
>
>     + "?Z rdf:type ub:Department. "
>
>     + "?X ub:memberOf ?Z. "
>
>     + "?X rdf:type ub:GraduateStudent. "
>
>     + "?X ub:undergraduateDegreeFrom ?Y. }";
>
>     Query query = QueryFactory.create(sparqlQueryString);
>
>     //Tried with and without the Algebra step
>
>     Op op = Algebra.compile(query) ;
>
>  This step happens automatically, there is no need to do it yourself
>
>     query = OpAsQuery.asQuery(op);
>
>  this step is entirely unnecessary
>
>     QueryExecution qexec = QueryExecutionFactory.create(query, dataset);
>
>     ResultSet results = qexec.execSelect();
>
>     ResultSetFormatter.out(results);
>
>     endTime = new Date();
>
> What are you are actually trying to time? Since you write out the results
> your timing both the time to execute the query and the time to format
> results. At larger scales the time to format can be huge
>
> More specifically you are asking for text output which creates an ASCII
> table which requires multiple passes over the data in order to determine
> the column sizes.
>
> I would suggest that you simply iterate over the result set to consume it
> (you can use ResultSetFormatter.consume() for this) if you want to time
> just a query execution.
>
> There are tools for benchmarking if you are interested that will give you
> much more accurate and reliable results:
>
> https://github.com/rvesse/sparql-query-bm
>
> Rob
>
>     duration = endTime.getTime() - startTime.getTime();
>
>     System.out.println(query.toString());
>
>     System.out.println("Original Query 1 Duration: " + duration );
>
>     }
>
>
>     Thanks Again.
>
>
>     Best,
>
>     Ganesh
>
>     On 10 January 2017 at 11:37, Andy Seaborne <[email protected]> wrote:
>
>     >
>     >
>     > On 09/01/17 19:40, A. Soroka wrote:
>     >
>     >> The layout of the statistics file is documented here:
>     >>
>     >> https://jena.apache.org/documentation/tdb/optimizer.html#
>     >> statistics-rule-file
>     >>
>     >> tdbloader and tdbloader2 are the CLI utilities for building TDB
>     >> databases, but they are written in Java and can be used in Java.
>     >>
>     >> https://jena.apache.org/documentation/tdb/commands.html#tdbloader
>     >>
>     >>
>     > Jena is open source and maven central has source artifacts that you
> IDE
>     > will automatically attach to your projects.
>     >
>     > See the package:
>     > org.apache.jena.tdb.solver.stats;
>     >
>     >         Andy
>     >
>     >
>     > ---
>     >> A. Soroka
>     >> The University of Virginia Library
>     >>
>     >> On Jan 9, 2017, at 2:36 PM, Ganesh Selvaraj <
> [email protected]>
>     >>> wrote:
>     >>>
>     >>> Hi All,
>     >>>
>     >>> I am using Jena TDB for my work. So far I could not find much
>     >>> documentation
>     >>> on data indexing and statistics building for Jena TDB.
>     >>>
>     >>> I would prefer doing it via a Java API.
>     >>>
>     >>> Any help/documentation is appreciated.
>     >>>
>     >>> Thanks
>     >>> Ganesh
>     >>>
>     >>
>     >>
>
>
>
>
>
>

Re: Jena TDB indexing and stats building

Reply via email to