Re: Regex performance in relation to the BGP

Saud Aljaloud Wed, 08 Jan 2014 04:31:41 -0800

Andy never disappoints me on these matters, whenever I browse the WEB about 
Jena and find a rich post, I immediately know it's Andy, This one is even more 
detailed.


And, yes, I tried it with in-memory tdb model, and it does select the least set 
of triples.

Cheers,
Saud.


On 7 Jan 2014, at 11:30, Milorad Tosic <[email protected]> wrote:

> Andy,
> 
> Thanks for the efforts to write this great explanation! IMHO, it is at least 
> as valuable as a published paper.
> 
> Regards,
> Milorad
> 
> 
> 
> 
>> ________________________________
>> From: Andy Seaborne <[email protected]>
>> To: [email protected] 
>> Sent: Tuesday, January 7, 2014 10:43 AM
>> Subject: Re: Regex performance in relation to the BGP
>> 
>> 
>> On 06/01/14 12:35, Saud Aljaloud wrote:
>>> Hi all,
>>> 
>>> Happy new year Jena guys, I am working on the performance of regex in 
>>> relation to triple stores. Since Jena is a good choice to inspect, I tried 
>>> to dig in the code and debug how sparql works out regex.
>>> 
>>> I'm using the latest source code 
>>> https://svn.apache.org/repos/asf/jena/trunk/jena-arq/
>>> 
>>> I tried to debug the class com.hp.hpl.jena.sparql.expr.RegexJava and check 
>>> how many visits to the method match() by printing a counter inside it,
>>> I found out that the "order" of the query pattern within sparql can affect 
>>> how many visits.
>>> 
>>> For example, testing this turtle "example.ttl":
>>> <http://www.example.org/s1><http://www.example.org/p1> "triple1@en".
>>> <http://www.example.org/s1><http://www.example.org/p1> "triple2@en".
>>> <http://www.example.org/s1><http://www.example.org/p2> "triple3@en".
>>> <http://www.example.org/s2><http://www.example.org/p2> "triple4@en".
>>> <http://www.example.org/s2><http://www.example.org/p3> "triple5@en".
>>> <http://www.example.org/s2><http://www.example.org/p3> "triple6@en".
>>> 
>>> 
>>> With the usual model building:
>>> Model model = ModelFactory.createDefaultModel();
>>> model.read("/Users/user/example.ttl");
>>> Query query = QueryFactory.create(queryString) ;
>>> QueryExecution qexec = QueryExecutionFactory.create(query, model) ;
>>> ResultSet results = qexec.execSelect() ;
>>> ResultSetFormatter.out(System.out, results);
>>> 
>>> 
>>> 
>>> And the sparql query:
>>> String queryString = "select * " +
>>>         "where {" +
>>>         "<http://www.example.org/s1> ?p ?o ." +
>>>         "?s ?p ?o ." +
>>>         "filter regex (?o, \"triple\")" +
>>>         "}";
>>> 
>>> yields 3 visits, which is perfect,
>>> 
>>> However, the query:
>>> String queryString = "select * " +
>>>         "where {" +
>>>         "?s ?p ?o ." +
>>>         "<http://www.example.org/s1> ?p ?o ." +
>>>         "filter regex (?o, \"triple\")" +
>>>         "}";
>>> 
>>> Although both return the same result set, yet here, it prints 6 visits,
>>> 
>>> Even narrowing the query to something like the below prints 6 visits:
>>> String queryString = "select * " +
>>>         "where {" +
>>>         "?s ?p ?o ." +
>>>         "<http://www.example.org/s1> <http://www.example.org/p1> ?o ." +
>>>         "<http://www.example.org/s1> ?p ?o ." +
>>>         "filter regex (?o, \"triple\")" +
>>>         "}";
>>> 
>>> 
>>> 
>>> Looks like the filter regex is only looking for the first query pattern 
>>> regardless the rest of the BGP.
>>> I'm not sure if this is the case for all other queries other than regex!
>>> 
>>> 
>>> Ideally, I think, sparql should reduce the set of tested triples, 
>>> regardless of the order of the query patterns, and since regex by its nature
>>> slows down the performance, this will even add more overhead on the overall 
>>> performance.
>>> 
>>> Is this a bug, or am I missing something here?
>>> 
>>> Cheers,
>>> Saud.
>> 
>> Hi there,
>> 
>> When an optimizer looks at that query, it is trying to find the most 
>> selective patterns to start with.  When a FILTER is involved, 
>> selectivity estimation is hard and there are two alternatives of whether 
>> to reorder the patterns, then place the filter, or place the filter then 
>> reorder the patterns.
>> 
>> The pattern here has the feature that one triple pattern is a subset of 
>> the other. The ARQ optimizer does not look for that case - there's 
>> nothing stopping that being done, it just isn't currently.
>> 
>> In the case of in-memory, we also don't want the optimizer to spend too 
>> long thinking.  Queries can be short and specific, used at high 
>> frequency. There is a real danger the optimizer will take longer than 
>> executing the query if the optimizer tries to enumerate and weight all 
>> possibilities.  (Virtuoso wrote at length about this for the early BSBM 
>> runs and they changed their optimizer from that experience.)  The side 
>> and negative effect is that how the query is written can affect performance.
>> 
>> The Jena optimizer has two stages, and maybe this should be changed to 
>> something richer. There is a stage before the query starts of high-level 
>> optimization, which is rewriting the algebra expressions to "better" 
>> ones.  Any given storage layer can choose the most appropriate 
>> transformations to apply (including bypassing this stage).  The storage 
>> execution can also choose to make storage specific changes, referred to 
>> as low-level optimizations.  Currently, that means reordering the 
>> triples in a BGP because it needs the storage layer to keep the statistics.
>> 
>> If filter placement is done early, in high-level ioptimization, it can 
>> break up BGPs.
>> 
>> You optimized (high level) query is:
>> 
>> (sequence
>>    (filter (regex ?o "triple")
>>      (bgp (triple <http://www.example.org/s1> ?p ?o)))
>>    (bgp (triple ?s ?p ?o)))
>> 
>> [from arq.qparse --print=opt]
>> 
>> and there is no single BGP to reorder any more.  In the other order you get:
>> 
>> (sequence
>>    (filter (regex ?o "triple")
>>      (bgp (triple ?s ?p ?o)))
>>    (bgp (triple <http://www.example.org/s1> ?p ?o)))
>> 
>> Whether starting with the filter is a good idea depends on the filter. 
>> Determining the true selectivity of the filter is hard and the in-memory 
>> system does not keep value frequency statistics so it does not know 
>> whether a filter is highly selective of not.
>> 
>> What would be better is a more integrated approach of alegbra reordering 
>> and storage-level BGP reordering.  At least reordering by how ground the 
>> triple patterns are would be good (usually!) before doing filter 
>> placement.
>> 
>> Reordering by simply how grounded a triple pattern is, rather than using 
>> data-specific statistics, which are quite some work to keep, is often 
>> just as good - slightly irritating to anyone whose written a statistics 
>> based reordering optimizer.
>> 
>> The in-memory order of the unbroken BGP is by groudedness:
>> 
>> (bgp
>>    (triple ?s ?p ?o)
>>    (triple <http://www.example.org/s1> ?p ?o)
>> )
>> 
>> and the second pattern will be first (its more grounded).  The filter 
>> placement got in the way of that.
>> 
>> TDB doesn't keep value-frequency stats either - it only uses triple 
>> pattern counts. You need value frequency to know whether
>> 
>> FILTER ( ?o < 56 ) is highly specific or not.  It looks a lot like
>> FILTER ( ?o >= 56 ) until you know the distribution of ?o.
>> 
>> TDB turns off the filter placement high-level optimization so that it 
>> sees the the whole BGP, and then it itself places the filter.  A 
>> downside of that is that filter placement across non-BGPs is lost - the 
>> codebase for the next release has a much improved filter placement 
>> optimization step but it does not yet have the controls to do some 
>> placement but delay filter/BGP placement.
>> 
>> It has been reported that TDB, running in-memory, can be 50x faster than 
>> the in-memory store. I suspect this is due to filter placement effects 
>> as it does better than in-memory in combining filters and BGPs.
>> 
>> You can try with..
>> 
>> Model model = TDBFactory.createDataset.getDefaultModel() ;
>> model.read ...
>> 
>> About regxps:
>> 
>> Regular expression matching can be costly but it's usually because the 
>> regular expression itself has choices.  Your example doesn't.  As it's a 
>> fixed string pattern, it is compiled once and used repeatedly.  Matching 
>> a fixed string isn't zero-cost but I would not have through it was very 
>> large.  If it had lots of meta-characters and disjunctions then the 
>> execution can be costly.
>> 
>> Regex is done with java.util.regex - there is alternative which is 
>> pedantically more accurate based on some Xerces code (Java does not have 
>> one of the flag equivalent of full XSD definition - "x" -which si to do 
>> with whitespace mangling for readability).
>> 
>> (PS languages are "triple1"@en, not "triple1@en" unless you have 
>> rdf:PlainLiterals and technically they can't don't appear in RDF data).
>> 
>> SPARQL does have CONTAINS:
>> 
>> SELECT * {
>> 
>>    <http://www.example.org/s1> ?p ?o .
>>    ?s ?p ?o .
>>    FILTER CONTAINS(?o, "triple")
>> }
>> 
>> So there current optimization framework is a pragmatic balance.  It can 
>> be made better.  You could try out different strategies within the Jena 
>> framework.
>> 
>> You can install your own optimizer.  The default one is just a 
>> particular set of rewrites and the app can change that by installing its 
>> own code.  See Optimize.rewrite that implements Rewrite.rewrite.  The 
>> optimization step can be set globally, per dataset or per query execution.
>> 
>>     Andy
>> 
>> 
>> 
>> 
>>

Re: Regex performance in relation to the BGP

Reply via email to