On 06/01/14 12:35, Saud Aljaloud wrote:
Hi all,

Happy new year Jena guys, I am working on the performance of regex in relation 
to triple stores. Since Jena is a good choice to inspect, I tried to dig in the 
code and debug how sparql works out regex.

I'm using the latest source code 
https://svn.apache.org/repos/asf/jena/trunk/jena-arq/

I tried to debug the class com.hp.hpl.jena.sparql.expr.RegexJava and check how 
many visits to the method match() by printing a counter inside it,
I found out that the "order" of the query pattern within sparql can affect how 
many visits.

For example, testing this turtle "example.ttl":
<http://www.example.org/s1><http://www.example.org/p1> "triple1@en".
<http://www.example.org/s1><http://www.example.org/p1> "triple2@en".
<http://www.example.org/s1><http://www.example.org/p2> "triple3@en".
<http://www.example.org/s2><http://www.example.org/p2> "triple4@en".
<http://www.example.org/s2><http://www.example.org/p3> "triple5@en".
<http://www.example.org/s2><http://www.example.org/p3> "triple6@en".


With the usual model building:
Model model = ModelFactory.createDefaultModel();
model.read("/Users/user/example.ttl");
Query query = QueryFactory.create(queryString) ;
QueryExecution qexec = QueryExecutionFactory.create(query, model) ;
ResultSet results = qexec.execSelect() ;
ResultSetFormatter.out(System.out, results);



And the sparql query:
String queryString = "select * " +
                "where {" +
                "<http://www.example.org/s1> ?p ?o ." +
                "?s ?p ?o ." +
                "filter regex (?o, \"triple\")" +
                "}";

yields 3 visits, which is perfect,

However, the query:
String queryString = "select * " +
                "where {" +
                "?s ?p ?o ." +
                "<http://www.example.org/s1> ?p ?o ." +
                "filter regex (?o, \"triple\")" +
                "}";

Although both return the same result set, yet here, it prints 6 visits,

Even narrowing the query to something like the below prints 6 visits:
String queryString = "select * " +
                "where {" +
                "?s ?p ?o ." +
                "<http://www.example.org/s1> <http://www.example.org/p1> ?o ." +
                "<http://www.example.org/s1> ?p ?o ." +
                "filter regex (?o, \"triple\")" +
                "}";



Looks like the filter regex is only looking for the first query pattern 
regardless the rest of the BGP.
I'm not sure if this is the case for all other queries other than regex!


Ideally, I think, sparql should reduce the set of tested triples, regardless of 
the order of the query patterns, and since regex by its nature
slows down the performance, this will even add more overhead on the overall 
performance.

Is this a bug, or am I missing something here?

Cheers,
Saud.

Hi there,

When an optimizer looks at that query, it is trying to find the most selective patterns to start with. When a FILTER is involved, selectivity estimation is hard and there are two alternatives of whether to reorder the patterns, then place the filter, or place the filter then reorder the patterns.

The pattern here has the feature that one triple pattern is a subset of the other. The ARQ optimizer does not look for that case - there's nothing stopping that being done, it just isn't currently.

In the case of in-memory, we also don't want the optimizer to spend too long thinking. Queries can be short and specific, used at high frequency. There is a real danger the optimizer will take longer than executing the query if the optimizer tries to enumerate and weight all possibilities. (Virtuoso wrote at length about this for the early BSBM runs and they changed their optimizer from that experience.) The side and negative effect is that how the query is written can affect performance.

The Jena optimizer has two stages, and maybe this should be changed to something richer. There is a stage before the query starts of high-level optimization, which is rewriting the algebra expressions to "better" ones. Any given storage layer can choose the most appropriate transformations to apply (including bypassing this stage). The storage execution can also choose to make storage specific changes, referred to as low-level optimizations. Currently, that means reordering the triples in a BGP because it needs the storage layer to keep the statistics.

If filter placement is done early, in high-level ioptimization, it can break up BGPs.

You optimized (high level) query is:

(sequence
  (filter (regex ?o "triple")
    (bgp (triple <http://www.example.org/s1> ?p ?o)))
  (bgp (triple ?s ?p ?o)))

[from arq.qparse --print=opt]

and there is no single BGP to reorder any more.  In the other order you get:

(sequence
  (filter (regex ?o "triple")
    (bgp (triple ?s ?p ?o)))
  (bgp (triple <http://www.example.org/s1> ?p ?o)))

Whether starting with the filter is a good idea depends on the filter. Determining the true selectivity of the filter is hard and the in-memory system does not keep value frequency statistics so it does not know whether a filter is highly selective of not.

What would be better is a more integrated approach of alegbra reordering and storage-level BGP reordering. At least reordering by how ground the triple patterns are would be good (usually!) before doing filter placement.

Reordering by simply how grounded a triple pattern is, rather than using data-specific statistics, which are quite some work to keep, is often just as good - slightly irritating to anyone whose written a statistics based reordering optimizer.

The in-memory order of the unbroken BGP is by groudedness:

(bgp
  (triple ?s ?p ?o)
  (triple <http://www.example.org/s1> ?p ?o)
)

and the second pattern will be first (its more grounded). The filter placement got in the way of that.

TDB doesn't keep value-frequency stats either - it only uses triple pattern counts. You need value frequency to know whether

FILTER ( ?o < 56 ) is highly specific or not.  It looks a lot like
FILTER ( ?o >= 56 ) until you know the distribution of ?o.

TDB turns off the filter placement high-level optimization so that it sees the the whole BGP, and then it itself places the filter. A downside of that is that filter placement across non-BGPs is lost - the codebase for the next release has a much improved filter placement optimization step but it does not yet have the controls to do some placement but delay filter/BGP placement.

It has been reported that TDB, running in-memory, can be 50x faster than the in-memory store. I suspect this is due to filter placement effects as it does better than in-memory in combining filters and BGPs.

You can try with..

Model model = TDBFactory.createDataset.getDefaultModel() ;
model.read ...

About regxps:

Regular expression matching can be costly but it's usually because the regular expression itself has choices. Your example doesn't. As it's a fixed string pattern, it is compiled once and used repeatedly. Matching a fixed string isn't zero-cost but I would not have through it was very large. If it had lots of meta-characters and disjunctions then the execution can be costly.

Regex is done with java.util.regex - there is alternative which is pedantically more accurate based on some Xerces code (Java does not have one of the flag equivalent of full XSD definition - "x" -which si to do with whitespace mangling for readability).

(PS languages are "triple1"@en, not "triple1@en" unless you have rdf:PlainLiterals and technically they can't don't appear in RDF data).

SPARQL does have CONTAINS:

SELECT * {
  <http://www.example.org/s1> ?p ?o .
  ?s ?p ?o .
  FILTER CONTAINS(?o, "triple")
}

So there current optimization framework is a pragmatic balance. It can be made better. You could try out different strategies within the Jena framework.

You can install your own optimizer. The default one is just a particular set of rewrites and the app can change that by installing its own code. See Optimize.rewrite that implements Rewrite.rewrite. The optimization step can be set globally, per dataset or per query execution.

        Andy


Reply via email to