Re: Regex performance in relation to the BGP

Andy Seaborne Tue, 07 Jan 2014 02:19:15 -0800

The code in SVN is in better shape than I remember :-)

The reworked filter placement optimizer (JENA-595) has controls forwhether filters get pushed into BGPs or not. TDB utilizes this sofilter are placed at the high level, BGPs reorder then filters placedwithin BGPs.


Jena used to (2.11.0 and before) not properly place a filter like:

SELECT * {
  ?s ?p ?o .
  FILTER(?o = 89)
  OPTIONAL { ?s ?q ?w }
}

It was:

(filter (= ?o 89)
  (conditional
    (bgp (triple ?s ?p ?o))
    (bgp (triple ?s ?q ?w))))

it is now in SVN:

(conditional
  (filter (= ?o 89)
    (bgp (triple ?s ?p ?o)))
  (bgp (triple ?s ?q ?w)))

JENA-619 is a follow-on to JENA-595 - this needs to be done forin-memory as well.


        Andy

[1] https://issues.apache.org/jira/browse/JENA-619

On 07/01/14 09:43, Andy Seaborne wrote:

On 06/01/14 12:35, Saud Aljaloud wrote:

Hi all,

Happy new year Jena guys, I am working on the performance of regex in
relation to triple stores. Since Jena is a good choice to inspect, I
tried to dig in the code and debug how sparql works out regex.

I'm using the latest source code
https://svn.apache.org/repos/asf/jena/trunk/jena-arq/

I tried to debug the class com.hp.hpl.jena.sparql.expr.RegexJava and
check how many visits to the method match() by printing a counter
inside it,
I found out that the "order" of the query pattern within sparql can
affect how many visits.

For example, testing this turtle "example.ttl":
<http://www.example.org/s1><http://www.example.org/p1> "triple1@en".
<http://www.example.org/s1><http://www.example.org/p1> "triple2@en".
<http://www.example.org/s1><http://www.example.org/p2> "triple3@en".
<http://www.example.org/s2><http://www.example.org/p2> "triple4@en".
<http://www.example.org/s2><http://www.example.org/p3> "triple5@en".
<http://www.example.org/s2><http://www.example.org/p3> "triple6@en".


With the usual model building:
Model model = ModelFactory.createDefaultModel();
model.read("/Users/user/example.ttl");
Query query = QueryFactory.create(queryString) ;
QueryExecution qexec = QueryExecutionFactory.create(query, model) ;
ResultSet results = qexec.execSelect() ;
ResultSetFormatter.out(System.out, results);



And the sparql query:
String queryString = "select * " +
        "where {" +
        "<http://www.example.org/s1> ?p ?o ." +
        "?s ?p ?o ." +
        "filter regex (?o, \"triple\")" +
        "}";

yields 3 visits, which is perfect,

However, the query:
String queryString = "select * " +
        "where {" +
        "?s ?p ?o ." +
        "<http://www.example.org/s1> ?p ?o ." +
        "filter regex (?o, \"triple\")" +
        "}";

Although both return the same result set, yet here, it prints 6 visits,

Even narrowing the query to something like the below prints 6 visits:
String queryString = "select * " +
        "where {" +
        "?s ?p ?o ." +
        "<http://www.example.org/s1> <http://www.example.org/p1> ?o ." +
        "<http://www.example.org/s1> ?p ?o ." +
        "filter regex (?o, \"triple\")" +
        "}";



Looks like the filter regex is only looking for the first query
pattern regardless the rest of the BGP.
I'm not sure if this is the case for all other queries other than regex!


Ideally, I think, sparql should reduce the set of tested triples,
regardless of the order of the query patterns, and since regex by its
nature
slows down the performance, this will even add more overhead on the
overall performance.

Is this a bug, or am I missing something here?

Cheers,
Saud.


Hi there,

When an optimizer looks at that query, it is trying to find the most
selective patterns to start with.  When a FILTER is involved,
selectivity estimation is hard and there are two alternatives of whether
to reorder the patterns, then place the filter, or place the filter then
reorder the patterns.

The pattern here has the feature that one triple pattern is a subset of
the other. The ARQ optimizer does not look for that case - there's
nothing stopping that being done, it just isn't currently.

In the case of in-memory, we also don't want the optimizer to spend too
long thinking.  Queries can be short and specific, used at high
frequency. There is a real danger the optimizer will take longer than
executing the query if the optimizer tries to enumerate and weight all
possibilities.  (Virtuoso wrote at length about this for the early BSBM
runs and they changed their optimizer from that experience.)  The side
and negative effect is that how the query is written can affect
performance.

The Jena optimizer has two stages, and maybe this should be changed to
something richer. There is a stage before the query starts of high-level
optimization, which is rewriting the algebra expressions to "better"
ones.  Any given storage layer can choose the most appropriate
transformations to apply (including bypassing this stage).  The storage
execution can also choose to make storage specific changes, referred to
as low-level optimizations.  Currently, that means reordering the
triples in a BGP because it needs the storage layer to keep the statistics.

If filter placement is done early, in high-level ioptimization, it can
break up BGPs.

You optimized (high level) query is:

(sequence
   (filter (regex ?o "triple")
     (bgp (triple <http://www.example.org/s1> ?p ?o)))
   (bgp (triple ?s ?p ?o)))

[from arq.qparse --print=opt]

and there is no single BGP to reorder any more.  In the other order you
get:

(sequence
   (filter (regex ?o "triple")
     (bgp (triple ?s ?p ?o)))
   (bgp (triple <http://www.example.org/s1> ?p ?o)))

Whether starting with the filter is a good idea depends on the filter.
Determining the true selectivity of the filter is hard and the in-memory
system does not keep value frequency statistics so it does not know
whether a filter is highly selective of not.

What would be better is a more integrated approach of alegbra reordering
and storage-level BGP reordering.  At least reordering by how ground the
triple patterns are would be good (usually!) before doing filter placement.

Reordering by simply how grounded a triple pattern is, rather than using
data-specific statistics, which are quite some work to keep, is often
just as good - slightly irritating to anyone whose written a statistics
based reordering optimizer.

The in-memory order of the unbroken BGP is by groudedness:

(bgp
   (triple ?s ?p ?o)
   (triple <http://www.example.org/s1> ?p ?o)
)

and the second pattern will be first (its more grounded).  The filter
placement got in the way of that.

TDB doesn't keep value-frequency stats either - it only uses triple
pattern counts. You need value frequency to know whether

FILTER ( ?o < 56 ) is highly specific or not.  It looks a lot like
FILTER ( ?o >= 56 ) until you know the distribution of ?o.

TDB turns off the filter placement high-level optimization so that it
sees the the whole BGP, and then it itself places the filter.  A
downside of that is that filter placement across non-BGPs is lost - the
codebase for the next release has a much improved filter placement
optimization step but it does not yet have the controls to do some
placement but delay filter/BGP placement.

It has been reported that TDB, running in-memory, can be 50x faster than
the in-memory store. I suspect this is due to filter placement effects
as it does better than in-memory in combining filters and BGPs.

You can try with..

Model model = TDBFactory.createDataset.getDefaultModel() ;
model.read ...

About regxps:

Regular expression matching can be costly but it's usually because the
regular expression itself has choices.  Your example doesn't.  As it's a
fixed string pattern, it is compiled once and used repeatedly.  Matching
a fixed string isn't zero-cost but I would not have through it was very
large.  If it had lots of meta-characters and disjunctions then the
execution can be costly.

Regex is done with java.util.regex - there is alternative which is
pedantically more accurate based on some Xerces code (Java does not have
one of the flag equivalent of full XSD definition - "x" -which si to do
with whitespace mangling for readability).

(PS languages are "triple1"@en, not "triple1@en" unless you have
rdf:PlainLiterals and technically they can't don't appear in RDF data).

SPARQL does have CONTAINS:

SELECT * {
   <http://www.example.org/s1> ?p ?o .
   ?s ?p ?o .
   FILTER CONTAINS(?o, "triple")
}

So there current optimization framework is a pragmatic balance.  It can
be made better.  You could try out different strategies within the Jena
framework.

You can install your own optimizer.  The default one is just a
particular set of rewrites and the app can change that by installing its
own code.  See Optimize.rewrite that implements Rewrite.rewrite.  The
optimization step can be set globally, per dataset or per query execution.

     Andy

Re: Regex performance in relation to the BGP

Reply via email to