Re: Regex performance in relation to the BGP

Andy Seaborne Tue, 07 Jan 2014 01:45:21 -0800

On 06/01/14 12:35, Saud Aljaloud wrote:

Hi all,


Happy new year Jena guys, I am working on the performance of regex in relation 
to triple stores. Since Jena is a good choice to inspect, I tried to dig in the 
code and debug how sparql works out regex.

I'm using the latest source code 
https://svn.apache.org/repos/asf/jena/trunk/jena-arq/

I tried to debug the class com.hp.hpl.jena.sparql.expr.RegexJava and check how 
many visits to the method match() by printing a counter inside it,
I found out that the "order" of the query pattern within sparql can affect how 
many visits.

For example, testing this turtle "example.ttl":
<http://www.example.org/s1><http://www.example.org/p1> "triple1@en".
<http://www.example.org/s1><http://www.example.org/p1> "triple2@en".
<http://www.example.org/s1><http://www.example.org/p2> "triple3@en".
<http://www.example.org/s2><http://www.example.org/p2> "triple4@en".
<http://www.example.org/s2><http://www.example.org/p3> "triple5@en".
<http://www.example.org/s2><http://www.example.org/p3> "triple6@en".


With the usual model building:
Model model = ModelFactory.createDefaultModel();
model.read("/Users/user/example.ttl");
Query query = QueryFactory.create(queryString) ;
QueryExecution qexec = QueryExecutionFactory.create(query, model) ;
ResultSet results = qexec.execSelect() ;
ResultSetFormatter.out(System.out, results);



And the sparql query:
String queryString = "select * " +
                "where {" +
                "<http://www.example.org/s1> ?p ?o ." +
                "?s ?p ?o ." +
                "filter regex (?o, \"triple\")" +
                "}";

yields 3 visits, which is perfect,

However, the query:
String queryString = "select * " +
                "where {" +
                "?s ?p ?o ." +
                "<http://www.example.org/s1> ?p ?o ." +
                "filter regex (?o, \"triple\")" +
                "}";

Although both return the same result set, yet here, it prints 6 visits,

Even narrowing the query to something like the below prints 6 visits:
String queryString = "select * " +
                "where {" +
                "?s ?p ?o ." +
                "<http://www.example.org/s1> <http://www.example.org/p1> ?o ." +
                "<http://www.example.org/s1> ?p ?o ." +
                "filter regex (?o, \"triple\")" +
                "}";



Looks like the filter regex is only looking for the first query pattern 
regardless the rest of the BGP.
I'm not sure if this is the case for all other queries other than regex!


Ideally, I think, sparql should reduce the set of tested triples, regardless of 
the order of the query patterns, and since regex by its nature
slows down the performance, this will even add more overhead on the overall 
performance.

Is this a bug, or am I missing something here?

Cheers,
Saud.


Hi there,

When an optimizer looks at that query, it is trying to find the mostselective patterns to start with. When a FILTER is involved,selectivity estimation is hard and there are two alternatives of whetherto reorder the patterns, then place the filter, or place the filter thenreorder the patterns.

The pattern here has the feature that one triple pattern is a subset ofthe other. The ARQ optimizer does not look for that case - there'snothing stopping that being done, it just isn't currently.

In the case of in-memory, we also don't want the optimizer to spend toolong thinking. Queries can be short and specific, used at highfrequency. There is a real danger the optimizer will take longer thanexecuting the query if the optimizer tries to enumerate and weight allpossibilities. (Virtuoso wrote at length about this for the early BSBMruns and they changed their optimizer from that experience.) The sideand negative effect is that how the query is written can affect performance.

The Jena optimizer has two stages, and maybe this should be changed tosomething richer. There is a stage before the query starts of high-leveloptimization, which is rewriting the algebra expressions to "better"ones. Any given storage layer can choose the most appropriatetransformations to apply (including bypassing this stage). The storageexecution can also choose to make storage specific changes, referred toas low-level optimizations. Currently, that means reordering thetriples in a BGP because it needs the storage layer to keep the statistics.

If filter placement is done early, in high-level ioptimization, it canbreak up BGPs.


You optimized (high level) query is:

(sequence
  (filter (regex ?o "triple")
    (bgp (triple <http://www.example.org/s1> ?p ?o)))
  (bgp (triple ?s ?p ?o)))

[from arq.qparse --print=opt]

and there is no single BGP to reorder any more.  In the other order you get:

(sequence
  (filter (regex ?o "triple")
    (bgp (triple ?s ?p ?o)))
  (bgp (triple <http://www.example.org/s1> ?p ?o)))

Whether starting with the filter is a good idea depends on the filter.Determining the true selectivity of the filter is hard and the in-memorysystem does not keep value frequency statistics so it does not knowwhether a filter is highly selective of not.

What would be better is a more integrated approach of alegbra reorderingand storage-level BGP reordering. At least reordering by how ground thetriple patterns are would be good (usually!) before doing filterplacement.

Reordering by simply how grounded a triple pattern is, rather than usingdata-specific statistics, which are quite some work to keep, is oftenjust as good - slightly irritating to anyone whose written a statisticsbased reordering optimizer.


The in-memory order of the unbroken BGP is by groudedness:

(bgp
  (triple ?s ?p ?o)
  (triple <http://www.example.org/s1> ?p ?o)
)

and the second pattern will be first (its more grounded). The filterplacement got in the way of that.

TDB doesn't keep value-frequency stats either - it only uses triplepattern counts. You need value frequency to know whether


FILTER ( ?o < 56 ) is highly specific or not.  It looks a lot like
FILTER ( ?o >= 56 ) until you know the distribution of ?o.

TDB turns off the filter placement high-level optimization so that itsees the the whole BGP, and then it itself places the filter. Adownside of that is that filter placement across non-BGPs is lost - thecodebase for the next release has a much improved filter placementoptimization step but it does not yet have the controls to do someplacement but delay filter/BGP placement.

It has been reported that TDB, running in-memory, can be 50x faster thanthe in-memory store. I suspect this is due to filter placement effectsas it does better than in-memory in combining filters and BGPs.


You can try with..

Model model = TDBFactory.createDataset.getDefaultModel() ;
model.read ...

About regxps:

Regular expression matching can be costly but it's usually because theregular expression itself has choices. Your example doesn't. As it's afixed string pattern, it is compiled once and used repeatedly. Matchinga fixed string isn't zero-cost but I would not have through it was verylarge. If it had lots of meta-characters and disjunctions then theexecution can be costly.

Regex is done with java.util.regex - there is alternative which ispedantically more accurate based on some Xerces code (Java does not haveone of the flag equivalent of full XSD definition - "x" -which si to dowith whitespace mangling for readability).

(PS languages are "triple1"@en, not "triple1@en" unless you haverdf:PlainLiterals and technically they can't don't appear in RDF data).


SPARQL does have CONTAINS:

SELECT * {
  <http://www.example.org/s1> ?p ?o .
  ?s ?p ?o .
  FILTER CONTAINS(?o, "triple")
}

So there current optimization framework is a pragmatic balance. It canbe made better. You could try out different strategies within the Jenaframework.

You can install your own optimizer. The default one is just aparticular set of rewrites and the app can change that by installing itsown code. See Optimize.rewrite that implements Rewrite.rewrite. Theoptimization step can be set globally, per dataset or per query execution.


        Andy

Re: Regex performance in relation to the BGP

Reply via email to