Re: optimizing serialization of results from fuseki

Paul Tyson Thu, 07 Jan 2016 07:57:44 -0800

Here is an actual query, partially obfuscated. It returns about 18K
nodes in 40 seconds, from a dataset of about 17M triples. (The nodes are
not necessarily distinct.)


The predominant graph structure is like:

?node <- ?lsu -> ?detail -> LSUPROPERTYVALUE

Thanks for your attention and any suggestions for improvement.

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix lsu: <http://rules.example.org/ns/lsu#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT (count(?node) as ?cnt)
WHERE {
?detail lsu:source "XYZ".
?detail lsu:length-type "Ltype".
?detail lsu:max-length-exclusive ?maxe_len;
  lsu:max-length-inclusive ?maxi_len;
  lsu:min-length-inclusive ?mine_len;
  lsu:min-length-exclusive ?mini_len.
FILTER (
  (?maxe_len = rdf:nil || ?maxe_len > "95"^^xsd:decimal)
  && (?maxi_len = rdf:nil || ?maxi_len >= "95"^^xsd:decimal)
  && (?mine_len = rdf:nil || ?mine_len < "95"^^xsd:decimal)
  && (?mini_len  = rdf:nil || ?mini_len <= "95"^^xsd:decimal)
)
?detail lsu:date-type "Date type 1".
{{
  ?detail lsu:retroactive true;
    lsu:end-date rdf:nil .
} UNION {
  ?detail lsu:retroactive false;
    lsu:start-date ?start ;
    lsu:end-date ?end .
  FILTER (?start <= "2006-08-11"^^xsd:date
  && (?end = rdf:nil || ?end >= "2006-08-11"^^xsd:date))
}}
?detail lsu:minimum-age ?min_age;
  lsu:maximum-age ?max_age.
FILTER ((?max_age = rdf:nil || ?max_age >= 8)
 && (?min_age = 0 || ?min_age < 8))
?detail lsu:applicable-for "adfsda" .
?detail lsu:v-type ?v_type.
FILTER (?v_type in (rdf:nil, <http://www.example.org/2015/7/abc>))
?detail lsu:s-type ?s_type.
FILTER (?s_type in (rdf:nil, <http://www.example.org/2015/7/dsfgdsa>))
?detail lsu:max-gg-exclusive ?maxe_gg;
  lsu:max-gg-inclusive ?maxi_gg;
  lsu:min-gg-inclusive ?mine_gg;
  lsu:min-gg-exclusive ?mini_gg.
FILTER (
  (?maxe_gg = rdf:nil || ?maxe_gg > "50"^^xsd:decimal)
  && (?maxi_gg = rdf:nil || ?maxi_gg >= "50"^^xsd:decimal)
  && (?mine_gg = rdf:nil || ?mine_gg < "50"^^xsd:decimal)
  && (?mini_gg = rdf:nil || ?mini_gg <= "50"^^xsd:decimal)
)
?detail lsu:h-m ?h_m.
FILTER (?h_m in (rdf:nil, <http://www.example.org/2015/7/hm1>))
{{
?detail lsu:v-func ?v_func.
FILTER (?v_func in
(<http://www.example.org/2015/7/vf1>,<http://www.example.org/2015/7/vf2>))
} UNION {
?detail lsu:c-n ?c_n.
FILTER (?c_n in
(<http://www.example.org/2015/7/cn1>,<http://www.example.org/2015/7/cn2>,<http://www.example.org/2015/7/cn3>,<http://www.example.org/2015/7/cn4>))
}}
?lsu lsu:lsu-d ?detail.
?lsu lsu:aF ?node.
}


On Thu, 2016-01-07 at 12:36 +0000, Andy Seaborne wrote:
> It looks like it is the query cost and not the
> 
> > So I conclude we are seeing the best performance possible unless there
> > is something terribly wrong with my queries. They are essentially of the
> > form:
> >
> 
> Details matter here - can you show a real query?
> 
> > select ?s
> > where {
> > ?nd :prop1 <uri1>;
> >   :prop2 "lit1";
> >   :prop3 ?var1;
> >   :prop4 ?var2;
> > # more properties of ?s
> 
> ?s doesn't appear until later.
> 
> There is a chance there are cross products in the real query.
> 
> > filter (?var1 > N1 && ?var1 < N2)
> > filter (?var2 in (<uriA>,<uriB>,...))
> 
> This usually gets optimized - maybe something else in your query is 
> blocking that.
> 
> Filter order can matter as well.
> 
> > #more filters on ?nd properties
> > ?s :p1/:p2 ?nd.
> > }
> >
> > Some of the filters get a little more complicated. And there is at least
> > one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected the
> > queries and run each individual piece (triple + filter), and it seems to
> > be the more complicated filters that start to slow things down, as might
> > be expected.
> >
> > Thanks for your comments and interest. The performance we're seeing is
> > unacceptable for our application requirements, so I wanted to see if
> > there were any other performance factors I had missed.
> 
>       Andy
> 
> On 07/01/16 08:48, Håvard Mikkelsen Ottestad wrote:
> > Hi,
> >
> > Reordering the filters might help.
> >
> > Also, maybe a stats file would reorder your query to be faster. I dunno how 
> > often (or if) fuseki generates a stats file. You can try to generate one by 
> > hand when fuseki is shutdown: 
> > https://jena.apache.org/documentation/tdb/optimizer.html
> >
> > Also I’m wondering what the performance is like if you take this line away:
> > ?s :p1/:p2 ?nd.
> >
> >
> > One major performance drain I have seen in the past is filters on string 
> > literals. Especially if you are doing anything like CONTAINS or LOWERCASE. 
> > Do you have any of that?
> >
> > Håvard
> >
> >
> >
> >
> > On 07/01/16 03:51, "Paul Tyson" <[email protected]> wrote:
> >
> >> On Wed, 2016-01-06 at 18:52 +0000, Andy Seaborne wrote:
> >>> Hi Paul,
> >>>
> >>>   > My question is: is total query time limited by search execution speed,
> >>>   > or by marshaling and serialization of search results?
> >>>
> >>> Costs are a bit of both but normally mainly query.  It also depends on
> >>> the client processing.
> >>>
> >>> Some context please:
> >>> 1/ What's the storage layer?
> >> TDB behind fuseki 2.3.1
> >>
> >>> 2/ What result set format are you getting?
> >> text/csv
> >>
> >>> 3/ How are you handling the results on receipt in the client?
> >> Just writing them to file for testing.
> >>
> >>>
> >>> (Håvard point about seeing data and query also applies)
> >> Sorry, not easy to share the data.
> >>
> >>>
> >>> The important point is that output is streamed.
> >>>
> >>> Result sent while the query is execution; it is not the case that the
> >>> query executes,. all the results calculated and then results are produced.
> >>>
> >>> To investigate, modify the query to do something like this
> >>>
> >>> SELECT (count(*) AS ?C) { ... }
> >>>
> >>> because then the result set cost is low and all the query is executed
> >>> before a result can be produced.
> >>>
> >> Yes, I did that, and the time is very nearly the same.
> >>
> >> So I conclude we are seeing the best performance possible unless there
> >> is something terribly wrong with my queries. They are essentially of the
> >> form:
> >>
> >> select ?s
> >> where {
> >> ?nd :prop1 <uri1>;
> >>   :prop2 "lit1";
> >>   :prop3 ?var1;
> >>   :prop4 ?var2;
> >> # more properties of ?s
> >> filter (?var1 > N1 && ?var1 < N2)
> >> filter (?var2 in (<uriA>,<uriB>,...))
> >> #more filters on ?nd properties
> >> ?s :p1/:p2 ?nd.
> >> }
> >>
> >> Some of the filters get a little more complicated. And there is at least
> >> one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected the
> >> queries and run each individual piece (triple + filter), and it seems to
> >> be the more complicated filters that start to slow things down, as might
> >> be expected.
> >>
> >> Thanks for your comments and interest. The performance we're seeing is
> >> unacceptable for our application requirements, so I wanted to see if
> >> there were any other performance factors I had missed.
> >>
> >> Regards,
> >> --Paul
> >>
> >>>      Andy
> >>>
> >>>
> >>> On 06/01/16 16:17, Paul Tyson wrote:
> >>>> I have a modest (17M triple) dataset, fairly flat graph. I run some
> >>>> queries selecting nodes with anywhere from 12-20 different property
> >>>> values.
> >>>>
> >>>> Result set counts are anywhere from 10,000 to 30,000 nodes. Total
> >>>> execution time measured at client are in the 30-40 second range.
> >>>>
> >>>> The web request begins streaming results immediately, but seems to take
> >>>> longer than it should (based on the number of results and size of data
> >>>> transfer). I also notice that the time is roughly linear with the size
> >>>> of dataset--halving the dataset size halves the result set and the
> >>>> execution time. I wouldn't have expected this behavior if all the time
> >>>> was due to an indexed search.
> >>>>
> >>>> My question is: is total query time limited by search execution speed,
> >>>> or by marshaling and serialization of search results?
> >>>>
> >>>> I have tried different query patterns, and believe I have the best
> >>>> queries possible for the use case.
> >>>>
> >>>> I'm looking for other suggestions to reduce overall execution time. The
> >>>> performance does not improve drastically going from 4Gb to 8 or 16Gb
> >>>> RAM. My test platforms are 64-bit Windows, ranging from small server
> >>>> (16Gb RAM, 4 CPU) to laptops with 4Gb RAM.
> >>>>
> >>>> Thanks,
> >>>> --Paul
> >>>>
> >>>
> >>
> >>
>

Re: optimizing serialization of results from fuseki

Reply via email to