On Wed, 2016-01-06 at 18:52 +0000, Andy Seaborne wrote:
> Hi Paul,
>
> > My question is: is total query time limited by search execution speed,
> > or by marshaling and serialization of search results?
>
> Costs are a bit of both but normally mainly query. It also depends on
> the client processing.
>
> Some context please:
> 1/ What's the storage layer?
TDB behind fuseki 2.3.1
> 2/ What result set format are you getting?
text/csv
> 3/ How are you handling the results on receipt in the client?
Just writing them to file for testing.
>
> (Håvard point about seeing data and query also applies)
Sorry, not easy to share the data.
>
> The important point is that output is streamed.
>
> Result sent while the query is execution; it is not the case that the
> query executes,. all the results calculated and then results are produced.
>
> To investigate, modify the query to do something like this
>
> SELECT (count(*) AS ?C) { ... }
>
> because then the result set cost is low and all the query is executed
> before a result can be produced.
>
Yes, I did that, and the time is very nearly the same.
So I conclude we are seeing the best performance possible unless there
is something terribly wrong with my queries. They are essentially of the
form:
select ?s
where {
?nd :prop1 <uri1>;
:prop2 "lit1";
:prop3 ?var1;
:prop4 ?var2;
# more properties of ?s
filter (?var1 > N1 && ?var1 < N2)
filter (?var2 in (<uriA>,<uriB>,...))
#more filters on ?nd properties
?s :p1/:p2 ?nd.
}
Some of the filters get a little more complicated. And there is at least
one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected the
queries and run each individual piece (triple + filter), and it seems to
be the more complicated filters that start to slow things down, as might
be expected.
Thanks for your comments and interest. The performance we're seeing is
unacceptable for our application requirements, so I wanted to see if
there were any other performance factors I had missed.
Regards,
--Paul
> Andy
>
>
> On 06/01/16 16:17, Paul Tyson wrote:
> > I have a modest (17M triple) dataset, fairly flat graph. I run some
> > queries selecting nodes with anywhere from 12-20 different property
> > values.
> >
> > Result set counts are anywhere from 10,000 to 30,000 nodes. Total
> > execution time measured at client are in the 30-40 second range.
> >
> > The web request begins streaming results immediately, but seems to take
> > longer than it should (based on the number of results and size of data
> > transfer). I also notice that the time is roughly linear with the size
> > of dataset--halving the dataset size halves the result set and the
> > execution time. I wouldn't have expected this behavior if all the time
> > was due to an indexed search.
> >
> > My question is: is total query time limited by search execution speed,
> > or by marshaling and serialization of search results?
> >
> > I have tried different query patterns, and believe I have the best
> > queries possible for the use case.
> >
> > I'm looking for other suggestions to reduce overall execution time. The
> > performance does not improve drastically going from 4Gb to 8 or 16Gb
> > RAM. My test platforms are 64-bit Windows, ranging from small server
> > (16Gb RAM, 4 CPU) to laptops with 4Gb RAM.
> >
> > Thanks,
> > --Paul
> >
>