Re: optimizing serialization of results from fuseki

Paul Tyson Thu, 07 Jan 2016 08:25:34 -0800

On Thu, 2016-01-07 at 08:48 +0000, Håvard Mikkelsen Ottestad wrote:
> Hi,
> 
> Reordering the filters might help.
I've tried a bit of this, without much noticeable effect. I understand
that these are pretty well optimized when the sparql text is converted
to algebra.


> 
> Also, maybe a stats file would reorder your query to be faster. I dunno how 
> often (or if) fuseki generates a stats file. You can try to generate one by 
> hand when fuseki is shutdown: 
> https://jena.apache.org/documentation/tdb/optimizer.html
> 
There is a stats.opt file so I assume it's using that for optimization.

> Also I’m wondering what the performance is like if you take this line away: 
> ?s :p1/:p2 ?nd.
> 
When this occurs last in the BGP it does not have much effect. At the
top of the BGP it really slowed things down.

> 
> One major performance drain I have seen in the past is filters on string 
> literals. Especially if you are doing anything like CONTAINS or LOWERCASE. Do 
> you have any of that?
> 
I think all my filters are on numeric literals or URIs (see sample query
in other post). On other projects I also have noticed the impact of
string filters, and got much better results using Lucene add-on for
that.

Regards,
--Paul

> Håvard
> 
> 
> 
> 
> On 07/01/16 03:51, "Paul Tyson" <[email protected]> wrote:
> 
> >On Wed, 2016-01-06 at 18:52 +0000, Andy Seaborne wrote:
> >> Hi Paul,
> >> 
> >>  > My question is: is total query time limited by search execution speed,
> >>  > or by marshaling and serialization of search results?
> >> 
> >> Costs are a bit of both but normally mainly query.  It also depends on 
> >> the client processing.
> >> 
> >> Some context please:
> >> 1/ What's the storage layer?
> >TDB behind fuseki 2.3.1
> >
> >> 2/ What result set format are you getting?
> >text/csv
> >
> >> 3/ How are you handling the results on receipt in the client?
> >Just writing them to file for testing.
> >
> >> 
> >> (Håvard point about seeing data and query also applies)
> >Sorry, not easy to share the data.
> >
> >> 
> >> The important point is that output is streamed.
> >> 
> >> Result sent while the query is execution; it is not the case that the 
> >> query executes,. all the results calculated and then results are produced.
> >> 
> >> To investigate, modify the query to do something like this
> >> 
> >> SELECT (count(*) AS ?C) { ... }
> >> 
> >> because then the result set cost is low and all the query is executed 
> >> before a result can be produced.
> >> 
> >Yes, I did that, and the time is very nearly the same.
> >
> >So I conclude we are seeing the best performance possible unless there
> >is something terribly wrong with my queries. They are essentially of the
> >form:
> >
> >select ?s
> >where {
> >?nd :prop1 <uri1>;
> >  :prop2 "lit1";
> >  :prop3 ?var1;
> >  :prop4 ?var2;
> ># more properties of ?s
> >filter (?var1 > N1 && ?var1 < N2)
> >filter (?var2 in (<uriA>,<uriB>,...))
> >#more filters on ?nd properties
> >?s :p1/:p2 ?nd.
> >}
> >
> >Some of the filters get a little more complicated. And there is at least
> >one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected the
> >queries and run each individual piece (triple + filter), and it seems to
> >be the more complicated filters that start to slow things down, as might
> >be expected.
> >
> >Thanks for your comments and interest. The performance we're seeing is
> >unacceptable for our application requirements, so I wanted to see if
> >there were any other performance factors I had missed.
> >
> >Regards,
> >--Paul
> >
> >>     Andy
> >> 
> >> 
> >> On 06/01/16 16:17, Paul Tyson wrote:
> >> > I have a modest (17M triple) dataset, fairly flat graph. I run some
> >> > queries selecting nodes with anywhere from 12-20 different property
> >> > values.
> >> >
> >> > Result set counts are anywhere from 10,000 to 30,000 nodes. Total
> >> > execution time measured at client are in the 30-40 second range.
> >> >
> >> > The web request begins streaming results immediately, but seems to take
> >> > longer than it should (based on the number of results and size of data
> >> > transfer). I also notice that the time is roughly linear with the size
> >> > of dataset--halving the dataset size halves the result set and the
> >> > execution time. I wouldn't have expected this behavior if all the time
> >> > was due to an indexed search.
> >> >
> >> > My question is: is total query time limited by search execution speed,
> >> > or by marshaling and serialization of search results?
> >> >
> >> > I have tried different query patterns, and believe I have the best
> >> > queries possible for the use case.
> >> >
> >> > I'm looking for other suggestions to reduce overall execution time. The
> >> > performance does not improve drastically going from 4Gb to 8 or 16Gb
> >> > RAM. My test platforms are 64-bit Windows, ranging from small server
> >> > (16Gb RAM, 4 CPU) to laptops with 4Gb RAM.
> >> >
> >> > Thanks,
> >> > --Paul
> >> >
> >> 
> >
> >

Re: optimizing serialization of results from fuseki

Reply via email to