SPARQL performance (new to the tech)

Martin Van Aken Thu, 20 May 2021 02:26:31 -0700

Hi Martynas,
Thanks a lot - that was exactly what I was wondering as indeed a lot of my
variables are just there to make sure I have them on the output but are
probably extending the search space a lot for no good reason. Did not know
I could wrap a query in a describe, this splitting the "what I want to see"
part and the "what I'm search on" part. Going to try this ASAP.


Thanks!

Martin

On Thu, 20 May 2021 at 11:12, Martynas Jusevičius <marty...@atomgraph.com>
wrote:

> Martin,
>
> Some of the OPTIONAL variables don't seem to be used anywhere else in the
> query.
>
> Rather than using SELECT to pull the data fields, can't you use it to
> only filter down the entities of interest, and wrap the whole thing
> into a DESCRIBE to retrieve their full descriptions as graphs?
> Something like:
>
> PREFIX  rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> PREFIX  iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
> PREFIX  xsd:  <http://www.w3.org/2001/XMLSchema#>
> PREFIX  rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> PREFIX  iospress: <http://ld.iospress.nl/rdf/ontology/>
> PREFIX  iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
>
> DESCRIBE *
> WHERE
>   { SELECT  ?paper ?author ?issueOrBook ?access ?journal
>     WHERE
>       {   { ?paper  rdf:type  iospress:Chapter }
>         UNION
>           { ?paper  rdf:type  iospress:Article }
>         ?paper    iospress:publicationDate  ?pubDate ;
>                  iospress:publicationIncludesKeyword ?keyword;
>                  iospress:publicationAuthorList  [ ?idx ?author ] .
>         ?issueOrBook  iospress:partOf   ?volumeOrSerie .
>         ?paper    iospress:partOf       ?issueOrBook
>         OPTIONAL
>           { ?paper  iospress:publicationAccessibility  ?access }
>         OPTIONAL
>           { ?volumeOrSerie
>                       iospress:partOf  ?journal
>           }
>         FILTER ( ( ( ( ( datatype(?pubDate) = xsd:date ) && (
> xsd:dateTime(?pubDate) > "1999-12-31T23:00:00.000Z"^^xsd:dateTime ) )
> && ( xsd:dateTime(?pubDate) < "2021-05-18T12:16:58.841Z"^^xsd:dateTime
> ) ) || ( ( ( datatype(?pubDate) = xsd:gYear ) && ( ?pubDate >=
> "2000"^^xsd:gYear ) ) && ( ?pubDate <= "2021"^^xsd:gYear ) ) ) &&
> regex(?keyword, "sickness", "i") )
>       }
>     ORDER BY ?pubDate ?paper
>     LIMIT   50
>   }
>
> On Thu, May 20, 2021 at 10:44 AM Martin Van Aken
> <mar...@joyouscoding.com> wrote:
> >
> > Andy,
> > A big thanks for this - it gives me some paths to explore. I think indeed
> > my biggest problems are in the optional parts - I'll run the test you
> > advised and also look in which case I may be able to get rid of the
> > optionals to avoid those situations that could lead to a big amount of
> > results as you mentioned. I'm already looking at getting my filters
> closer
> > to definition - can this be done for things other than pure equality (for
> > example for the date that are testing for a range?).
> >
> > Maybe one question about optional - I use them in some cases to avoid
> empty
> > results. An example is Access - some paper have an Access triple (Open or
> > Closed) - but some have none. My understanding is that if I make a link
> > without optional like:
> >
> > ?paper iospress:accessibility ?access
> >
> > this will de facto remove all papers without access from the set. This is
> > something I don't want (I want them in the list, just with an empty value
> > there) - and my understanding is that the way to manage this is an
> > Optional. Is this correct? Is there a "better" way? If this ends up being
> > costly, I could also check to actually have a value for those (those
> > without value are technically "Closed").
> >
> > Something I was wondering also is whether it makes sense to split the
> > fields I need for search/filtering vs the ones I want to see on the
> result.
> > I've a feeling that in theory I could play with two queries - one with
> only
> > the params I need for the filtering, then play something similar to
> > DESCRIBE on each record on the filtered set - but I've no idea if this
> > would be more performant than keeping it together as it is now.
> >
> > Anyway, the exchanges here are much appreciated!
> >
> > On Tue, 18 May 2021 at 19:18, Andy Seaborne <a...@apache.org> wrote:
> >
> > > Martin,
> > >
> > > That's a complicated query and I haven't got my head aroud it
> completely
> > > yet.
> > >
> > > There are some useful points to understand:
> > >
> > > A::
> > >
> > > What is the time and outcome of these queries that focus on the main
> > > data location part:
> > >
> > > 1/
> > >
> > > SELECT (count(*) AS ?C) {
> > >   ?paper  iospress:publicationDate ?pubDate
> > >   FILTER(...date test...)
> > > }
> > >
> > > 2/
> > >   SELECT (count(*) AS ?C) {
> > >   ?paper  iospress:publicationDate ?pubDate
> > >           iospress:publicationIncludesKeyword ?keyword .
> > >   FILETER (...date... && (regex (?keyword, "sickness", "i"))
> > >
> > > 3/
> > > SELECT (count(*) AS ?C) {
> > >    {?paper rdf:type iospress:Chapter.}
> > >              union
> > >    {?paper rdf:type iospress:Article.}
> > >    ?paper  iospress:publicationDate ?pubDate
> > >    FILTER(...date test))
> > > }
> > >
> > > 4/
> > > SELECT (count(*) AS ?C) {
> > >   ?paper  iospress:publicationDate ?pubDate
> > >   FILTER(.. date test...)
> > >    {?paper rdf:type iospress:Chapter.}
> > >              union
> > >    {?paper rdf:type iospress:Article.}
> > > }
> > >
> > > B::
> > >
> > > then is it the case that some optionals have more effect than others?
> > > Some are "high risk":
> > >
> > > ---
> > >      OPTIONAL {
> > >          ?author iospress:contributorAffiliation ?affiliation.
> > >          ?affiliation rdfs:label ?university;
> > >      }
> > >       OPTIONAL {
> > >        ?affiliation iospress:geocodingOutput ?geocoded.
> > >        ?geocoded iospress-geocode:country ?country
> > >      }
> > > ---
> > > Suppose the first does not match then the second is a lot of results
> > > unrelated to ?paper.
> > >
> > > C::
> > >
> > > distinct
> > >
> > > it might be worth trying without distinct because distinct can cause a
> > > lot of results to be reduced to just a few, hiding redundant work.
> > >
> > >      Andy
> > >
> > > On 18/05/2021 13:31, Martin Van Aken wrote:
> > > > Hello again,
> > > > After some more days of me trying to get a better performance & the
> > > > approval of my company, here is what we try to run (query at the
> bottom
> > > of
> > > > the mail).
> > > >
> > > > For some context:
> > > >
> > > > - This is a search for academia papers. Papers have multiple
> authors, and
> > > > authors are part of multiple universities. Papers also have multiple
> > > > keywords and are generally part of a set (an issue) itself part of a
> set
> > > (a
> > > > volume) itself part of a set (a journal).
> > > > - Our goal is to have a multicriteria search front end, so the query
> is
> > > > generated from a form with clauses selected by the user. The
> structure is
> > > > always the same, this example use a single condition on the "keyword"
> > > > - The set of data is relatively small - around 150k papers (so
> probably
> > > 1M
> > > > triples there), probably around 500k authors
> > > > - We use group/concat as we want to give as results one line per
> paper
> > > (vs
> > > > having one per paper per keyword for example)
> > > > - I've read OPTIONALS are pretty bad - but I've no real alternative
> here
> > > > that I know off when some fields can be present or not and I don't
> want
> > > to
> > > > throw away all that miss at least one
> > > >
> > > > For our current results, all but the most precise queries (getting
> into a
> > > > super limited set of papers, like <10) get extremely slow (easily to
> > > dozens
> > > > of seconds, sometimes more). I feel that there is something obvious
> that
> > > > I'm missing, either in the query or my Jena config. The server is on
> an
> > > old
> > > > version but I make my tests locally on a 4.0.0 "out of the box" (0
> > > > configuration).
> > > >
> > > > What I've tried:
> > > >
> > > > - Removing the ORDER does not impact much
> > > > - Removing most optionals works... but remove the point of the query
> > > > - Using contains instead of regex does not impact much (I've the
> goal to
> > > > use Jena/Lucene integration for everything text related)
> > > >
> > > > I'm really in for an opinion as taking my RDBMS background this is
> the
> > > > equivalent of less than 3M records split on around 8 tables -
> something
> > > > that should be queryable mostly in sub second times.
> > > >
> > > > Any feedback is most welcome !
> > > >
> > > > Martin
> > > >
> > > > PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> > > >      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> > > >      PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
> > > >      PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
> > > >      PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
> > > >      PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
> > > >
> > > >      SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
> > > >          (group_concat(distinct ?authorName;separator=", ") as
> ?Authors)
> > > >          (group_concat(distinct ?keyword;separator=", ") as
> ?keywords)
> > > >          (group_concat(distinct ?university;separator=", ") as
> > > ?universities)
> > > >          (group_concat(distinct ?country;separator=", ") as
> ?countries)
> > > >      WHERE {
> > > >          {?paper rdf:type iospress:Chapter.}
> > > >              union
> > > >          {?paper rdf:type iospress:Article.}
> > > >
> > > >          ?paper rdfs:label ?title;
> > > >                   rdf:type ?type;
> > > >
> > > >                   iospress:publicationDate ?pubDate;
> > > >                   iospress:publicationAbstract ?abstract;
> > > >
> > > >                   iospress:publicationIncludesKeyword ?keyword;
> > > >                   iospress:publicationAuthorList [?idx ?author].
> > > >
> > > >          ?issueOrBook iospress:partOf ?volumeOrSerie.
> > > >          ?paper iospress:partOf ?issueOrBook.
> > > >
> > > >
> > > >      OPTIONAL {
> > > >          ?issueOrBook iospress:isbn ?bookIsbn.
> > > >      }
> > > >      OPTIONAL {
> > > >          ?paper iospress:publicationDoiUrl ?doi.
> > > >      }
> > > >      OPTIONAL {
> > > >          ?author rdfs:label ?authorName.
> > > >      }
> > > >      OPTIONAL {
> > > >          ?author iospress:contributorAffiliation ?affiliation.
> > > >          ?affiliation rdfs:label ?university;
> > > >      }
> > > >       OPTIONAL {
> > > >        ?affiliation iospress:geocodingOutput ?geocoded.
> > > >        ?geocoded iospress-geocode:country ?country
> > > >      }
> > > >      OPTIONAL {
> > > >          ?paper iospress:publicationAccessibility ?access.
> > > >      }
> > > >      OPTIONAL {
> > > >          ?volumeOrSerie iospress:partOf ?journal;
> > > >      }
> > > >      FILTER(
> > > >          (
> > > >              (datatype(?pubDate) = xsd:date &&
> xsd:dateTime(?pubDate) >
> > > > "1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
> > > > "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
> > > >              (datatype(?pubDate) = xsd:gYear && ?pubDate >=
> > > > "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
> > > >          )
> > > >
> > > >          && (regex (?keyword, "sickness", "i"))
> > > >          )
> > > >      }
> > > >      GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access
> > > >
> > > >      ORDER BY ?pubDate ?paper
> > > >      LIMIT 50
> > > >
> > > >
> > > > On Thu, 6 May 2021 at 20:10, Andy Seaborne <a...@apache.org> wrote:
> > > >
> > > >> Hi there,
> > > >>
> > > >> Showing the query would be helpful but some general remarks:
> > > >>
> > > >> 1/ If the query or the setup for Fuseki is needing more than the
> default
> > > >> heap size, then it might be that the Java JVM is getting into a
> state of
> > > >> heap exhaustion. This manifests as the CPU loading getting very
> high. It
> > > >> will seem like nothing is happening (waiting for response).
> > > >>
> > > >> 2/ The query may be expensive.
> > > >>
> > > >> Things to look for
> > > >> * cross products - two parts of the query pattern that are not
> > > >> connected.
> > > >>
> > > >> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
> > > >>
> > > >> * sort, spilling to disk or combined with a cross product the query.
> > > >>
> > > >> 3/ If no results are coming back, then the query is form that does
> not
> > > >> stream back - sort, or CONSTRUCT maybe.
> > > >>
> > > >> There was a useful presentation recently that talks about the
> principles
> > > >> of query efficiency.
> > > >>
> > > >> SPARQL Query Optimization with Pavel Klinov
> > > >> https://www.youtube.com/watch?v=16eMswT2x2Y
> > > >>
> > > >> More inline:
> > > >>
> > > >> On 06/05/2021 09:54, Martin Van Aken wrote:
> > > >>> Hi!
> > > >>> I'm Martin, I'm a software developer new to the Triples/SPARQL
> world.
> > > I'm
> > > >>> currently building queries against a Fuseki/TDB backend (that I can
> > > work
> > > >> on
> > > >>> too) and I'm getting into significant performance problems
> (including
> > > >> never
> > > >>> ending queries).
> > > >>
> > > >> Are updates also happening at the same time?
> > > >>
> > > >>> Despite what I thought was a good search on the apache
> > > >>> jena website I could not find a lot of insight about performance
> > > >>> investigation so I'm trying it here.
> > > >>>
> > > >>> Most of my data experience comes from the relational world (ex:
> PG) so
> > > >> I'm
> > > >>> sometimes drawing comparisons there.
> > > >>>
> > > >>> To give some context my data set is around 15 linked concepts,
> with the
> > > >>> number of triples for each ranging from some hundreds to 500K -
> total
> > > >> less
> > > >>> than 2 millions (documents/authors/publication kind of data).
> > > >>>
> > > >>> Unto questions:
> > > >>>
> > > >>>      - When I'm facing a slow query, what are my investigation
> > > options. Is
> > > >>>      there an equivalent of an "explain plan" in SQL pointing to
> the
> > > query
> > > >>>      specific slow points? What's the advised way for performance
> > > checks
> > > >> in
> > > >>>      SPARQL?
> > > >>
> > > >> qparse --print=opt --file query.rq
> > > >>
> > > >>>      - Are there any performance setups to be aware of on the
> server
> > > side?
> > > >>>      Like ways to check indexes are correctly built (outside of
> text
> > > >> search that
> > > >>>      I'm not working with for the moment)
> > > >>>      - We're currently using TDB1. I've seen the transactional
> > > benefits of
> > > >>>      TDB2 - are there performance improvements too that would
> warrant a
> > > >>>      migration there ?
> > > >>
> > > >> Not on the query side.
> > > >>
> > > >>       Andy
> > > >>
> > > >>>
> > > >>> Thanks a lot already!
> > > >>>
> > > >>> Martin
> > > >>>
> > > >>
> > > >
> > > >
> > >
> >
> >
> > --
> > *Martin Van Aken - **Freelance Enthusiast Developer*
> >
> > Mobile : +32 486 899 652
> >
> > Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
> > Call me on Skype : vanakenm
> > Hang out with me : mar...@joyouscoding.com
> > Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
> > Company website : www.joyouscoding.com
>


-- 
*Martin Van Aken - **Freelance Enthusiast Developer*

Mobile : +32 486 899 652

Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
Call me on Skype : vanakenm
Hang out with me : mar...@joyouscoding.com
Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
Company website : www.joyouscoding.com

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Reply via email to