Andy never disappoints me on these matters, whenever I browse the WEB about Jena and find a rich post, I immediately know it's Andy, This one is even more detailed.
And, yes, I tried it with in-memory tdb model, and it does select the least set of triples. Cheers, Saud. On 7 Jan 2014, at 11:30, Milorad Tosic <[email protected]> wrote: > Andy, > > Thanks for the efforts to write this great explanation! IMHO, it is at least > as valuable as a published paper. > > Regards, > Milorad > > > > >> ________________________________ >> From: Andy Seaborne <[email protected]> >> To: [email protected] >> Sent: Tuesday, January 7, 2014 10:43 AM >> Subject: Re: Regex performance in relation to the BGP >> >> >> On 06/01/14 12:35, Saud Aljaloud wrote: >>> Hi all, >>> >>> Happy new year Jena guys, I am working on the performance of regex in >>> relation to triple stores. Since Jena is a good choice to inspect, I tried >>> to dig in the code and debug how sparql works out regex. >>> >>> I'm using the latest source code >>> https://svn.apache.org/repos/asf/jena/trunk/jena-arq/ >>> >>> I tried to debug the class com.hp.hpl.jena.sparql.expr.RegexJava and check >>> how many visits to the method match() by printing a counter inside it, >>> I found out that the "order" of the query pattern within sparql can affect >>> how many visits. >>> >>> For example, testing this turtle "example.ttl": >>> <http://www.example.org/s1><http://www.example.org/p1> "triple1@en". >>> <http://www.example.org/s1><http://www.example.org/p1> "triple2@en". >>> <http://www.example.org/s1><http://www.example.org/p2> "triple3@en". >>> <http://www.example.org/s2><http://www.example.org/p2> "triple4@en". >>> <http://www.example.org/s2><http://www.example.org/p3> "triple5@en". >>> <http://www.example.org/s2><http://www.example.org/p3> "triple6@en". >>> >>> >>> With the usual model building: >>> Model model = ModelFactory.createDefaultModel(); >>> model.read("/Users/user/example.ttl"); >>> Query query = QueryFactory.create(queryString) ; >>> QueryExecution qexec = QueryExecutionFactory.create(query, model) ; >>> ResultSet results = qexec.execSelect() ; >>> ResultSetFormatter.out(System.out, results); >>> >>> >>> >>> And the sparql query: >>> String queryString = "select * " + >>> "where {" + >>> "<http://www.example.org/s1> ?p ?o ." + >>> "?s ?p ?o ." + >>> "filter regex (?o, \"triple\")" + >>> "}"; >>> >>> yields 3 visits, which is perfect, >>> >>> However, the query: >>> String queryString = "select * " + >>> "where {" + >>> "?s ?p ?o ." + >>> "<http://www.example.org/s1> ?p ?o ." + >>> "filter regex (?o, \"triple\")" + >>> "}"; >>> >>> Although both return the same result set, yet here, it prints 6 visits, >>> >>> Even narrowing the query to something like the below prints 6 visits: >>> String queryString = "select * " + >>> "where {" + >>> "?s ?p ?o ." + >>> "<http://www.example.org/s1> <http://www.example.org/p1> ?o ." + >>> "<http://www.example.org/s1> ?p ?o ." + >>> "filter regex (?o, \"triple\")" + >>> "}"; >>> >>> >>> >>> Looks like the filter regex is only looking for the first query pattern >>> regardless the rest of the BGP. >>> I'm not sure if this is the case for all other queries other than regex! >>> >>> >>> Ideally, I think, sparql should reduce the set of tested triples, >>> regardless of the order of the query patterns, and since regex by its nature >>> slows down the performance, this will even add more overhead on the overall >>> performance. >>> >>> Is this a bug, or am I missing something here? >>> >>> Cheers, >>> Saud. >> >> Hi there, >> >> When an optimizer looks at that query, it is trying to find the most >> selective patterns to start with. When a FILTER is involved, >> selectivity estimation is hard and there are two alternatives of whether >> to reorder the patterns, then place the filter, or place the filter then >> reorder the patterns. >> >> The pattern here has the feature that one triple pattern is a subset of >> the other. The ARQ optimizer does not look for that case - there's >> nothing stopping that being done, it just isn't currently. >> >> In the case of in-memory, we also don't want the optimizer to spend too >> long thinking. Queries can be short and specific, used at high >> frequency. There is a real danger the optimizer will take longer than >> executing the query if the optimizer tries to enumerate and weight all >> possibilities. (Virtuoso wrote at length about this for the early BSBM >> runs and they changed their optimizer from that experience.) The side >> and negative effect is that how the query is written can affect performance. >> >> The Jena optimizer has two stages, and maybe this should be changed to >> something richer. There is a stage before the query starts of high-level >> optimization, which is rewriting the algebra expressions to "better" >> ones. Any given storage layer can choose the most appropriate >> transformations to apply (including bypassing this stage). The storage >> execution can also choose to make storage specific changes, referred to >> as low-level optimizations. Currently, that means reordering the >> triples in a BGP because it needs the storage layer to keep the statistics. >> >> If filter placement is done early, in high-level ioptimization, it can >> break up BGPs. >> >> You optimized (high level) query is: >> >> (sequence >> (filter (regex ?o "triple") >> (bgp (triple <http://www.example.org/s1> ?p ?o))) >> (bgp (triple ?s ?p ?o))) >> >> [from arq.qparse --print=opt] >> >> and there is no single BGP to reorder any more. In the other order you get: >> >> (sequence >> (filter (regex ?o "triple") >> (bgp (triple ?s ?p ?o))) >> (bgp (triple <http://www.example.org/s1> ?p ?o))) >> >> Whether starting with the filter is a good idea depends on the filter. >> Determining the true selectivity of the filter is hard and the in-memory >> system does not keep value frequency statistics so it does not know >> whether a filter is highly selective of not. >> >> What would be better is a more integrated approach of alegbra reordering >> and storage-level BGP reordering. At least reordering by how ground the >> triple patterns are would be good (usually!) before doing filter >> placement. >> >> Reordering by simply how grounded a triple pattern is, rather than using >> data-specific statistics, which are quite some work to keep, is often >> just as good - slightly irritating to anyone whose written a statistics >> based reordering optimizer. >> >> The in-memory order of the unbroken BGP is by groudedness: >> >> (bgp >> (triple ?s ?p ?o) >> (triple <http://www.example.org/s1> ?p ?o) >> ) >> >> and the second pattern will be first (its more grounded). The filter >> placement got in the way of that. >> >> TDB doesn't keep value-frequency stats either - it only uses triple >> pattern counts. You need value frequency to know whether >> >> FILTER ( ?o < 56 ) is highly specific or not. It looks a lot like >> FILTER ( ?o >= 56 ) until you know the distribution of ?o. >> >> TDB turns off the filter placement high-level optimization so that it >> sees the the whole BGP, and then it itself places the filter. A >> downside of that is that filter placement across non-BGPs is lost - the >> codebase for the next release has a much improved filter placement >> optimization step but it does not yet have the controls to do some >> placement but delay filter/BGP placement. >> >> It has been reported that TDB, running in-memory, can be 50x faster than >> the in-memory store. I suspect this is due to filter placement effects >> as it does better than in-memory in combining filters and BGPs. >> >> You can try with.. >> >> Model model = TDBFactory.createDataset.getDefaultModel() ; >> model.read ... >> >> About regxps: >> >> Regular expression matching can be costly but it's usually because the >> regular expression itself has choices. Your example doesn't. As it's a >> fixed string pattern, it is compiled once and used repeatedly. Matching >> a fixed string isn't zero-cost but I would not have through it was very >> large. If it had lots of meta-characters and disjunctions then the >> execution can be costly. >> >> Regex is done with java.util.regex - there is alternative which is >> pedantically more accurate based on some Xerces code (Java does not have >> one of the flag equivalent of full XSD definition - "x" -which si to do >> with whitespace mangling for readability). >> >> (PS languages are "triple1"@en, not "triple1@en" unless you have >> rdf:PlainLiterals and technically they can't don't appear in RDF data). >> >> SPARQL does have CONTAINS: >> >> SELECT * { >> >> <http://www.example.org/s1> ?p ?o . >> ?s ?p ?o . >> FILTER CONTAINS(?o, "triple") >> } >> >> So there current optimization framework is a pragmatic balance. It can >> be made better. You could try out different strategies within the Jena >> framework. >> >> You can install your own optimizer. The default one is just a >> particular set of rewrites and the app can change that by installing its >> own code. See Optimize.rewrite that implements Rewrite.rewrite. The >> optimization step can be set globally, per dataset or per query execution. >> >> Andy >> >> >> >> >>
