On Thu, 2016-01-07 at 08:48 +0000, Håvard Mikkelsen Ottestad wrote: > Hi, > > Reordering the filters might help. I've tried a bit of this, without much noticeable effect. I understand that these are pretty well optimized when the sparql text is converted to algebra.
> > Also, maybe a stats file would reorder your query to be faster. I dunno how > often (or if) fuseki generates a stats file. You can try to generate one by > hand when fuseki is shutdown: > https://jena.apache.org/documentation/tdb/optimizer.html > There is a stats.opt file so I assume it's using that for optimization. > Also I’m wondering what the performance is like if you take this line away: > ?s :p1/:p2 ?nd. > When this occurs last in the BGP it does not have much effect. At the top of the BGP it really slowed things down. > > One major performance drain I have seen in the past is filters on string > literals. Especially if you are doing anything like CONTAINS or LOWERCASE. Do > you have any of that? > I think all my filters are on numeric literals or URIs (see sample query in other post). On other projects I also have noticed the impact of string filters, and got much better results using Lucene add-on for that. Regards, --Paul > Håvard > > > > > On 07/01/16 03:51, "Paul Tyson" <[email protected]> wrote: > > >On Wed, 2016-01-06 at 18:52 +0000, Andy Seaborne wrote: > >> Hi Paul, > >> > >> > My question is: is total query time limited by search execution speed, > >> > or by marshaling and serialization of search results? > >> > >> Costs are a bit of both but normally mainly query. It also depends on > >> the client processing. > >> > >> Some context please: > >> 1/ What's the storage layer? > >TDB behind fuseki 2.3.1 > > > >> 2/ What result set format are you getting? > >text/csv > > > >> 3/ How are you handling the results on receipt in the client? > >Just writing them to file for testing. > > > >> > >> (Håvard point about seeing data and query also applies) > >Sorry, not easy to share the data. > > > >> > >> The important point is that output is streamed. > >> > >> Result sent while the query is execution; it is not the case that the > >> query executes,. all the results calculated and then results are produced. > >> > >> To investigate, modify the query to do something like this > >> > >> SELECT (count(*) AS ?C) { ... } > >> > >> because then the result set cost is low and all the query is executed > >> before a result can be produced. > >> > >Yes, I did that, and the time is very nearly the same. > > > >So I conclude we are seeing the best performance possible unless there > >is something terribly wrong with my queries. They are essentially of the > >form: > > > >select ?s > >where { > >?nd :prop1 <uri1>; > > :prop2 "lit1"; > > :prop3 ?var1; > > :prop4 ?var2; > ># more properties of ?s > >filter (?var1 > N1 && ?var1 < N2) > >filter (?var2 in (<uriA>,<uriB>,...)) > >#more filters on ?nd properties > >?s :p1/:p2 ?nd. > >} > > > >Some of the filters get a little more complicated. And there is at least > >one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected the > >queries and run each individual piece (triple + filter), and it seems to > >be the more complicated filters that start to slow things down, as might > >be expected. > > > >Thanks for your comments and interest. The performance we're seeing is > >unacceptable for our application requirements, so I wanted to see if > >there were any other performance factors I had missed. > > > >Regards, > >--Paul > > > >> Andy > >> > >> > >> On 06/01/16 16:17, Paul Tyson wrote: > >> > I have a modest (17M triple) dataset, fairly flat graph. I run some > >> > queries selecting nodes with anywhere from 12-20 different property > >> > values. > >> > > >> > Result set counts are anywhere from 10,000 to 30,000 nodes. Total > >> > execution time measured at client are in the 30-40 second range. > >> > > >> > The web request begins streaming results immediately, but seems to take > >> > longer than it should (based on the number of results and size of data > >> > transfer). I also notice that the time is roughly linear with the size > >> > of dataset--halving the dataset size halves the result set and the > >> > execution time. I wouldn't have expected this behavior if all the time > >> > was due to an indexed search. > >> > > >> > My question is: is total query time limited by search execution speed, > >> > or by marshaling and serialization of search results? > >> > > >> > I have tried different query patterns, and believe I have the best > >> > queries possible for the use case. > >> > > >> > I'm looking for other suggestions to reduce overall execution time. The > >> > performance does not improve drastically going from 4Gb to 8 or 16Gb > >> > RAM. My test platforms are 64-bit Windows, ranging from small server > >> > (16Gb RAM, 4 CPU) to laptops with 4Gb RAM. > >> > > >> > Thanks, > >> > --Paul > >> > > >> > > > >
