Hi Martynas, Thanks a lot - that was exactly what I was wondering as indeed a lot of my variables are just there to make sure I have them on the output but are probably extending the search space a lot for no good reason. Did not know I could wrap a query in a describe, this splitting the "what I want to see" part and the "what I'm search on" part. Going to try this ASAP.
Thanks! Martin On Thu, 20 May 2021 at 11:12, Martynas Jusevičius <marty...@atomgraph.com> wrote: > Martin, > > Some of the OPTIONAL variables don't seem to be used anywhere else in the > query. > > Rather than using SELECT to pull the data fields, can't you use it to > only filter down the entities of interest, and wrap the whole thing > into a DESCRIBE to retrieve their full descriptions as graphs? > Something like: > > PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> > PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/> > PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> > PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> > PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/> > PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/> > > DESCRIBE * > WHERE > { SELECT ?paper ?author ?issueOrBook ?access ?journal > WHERE > { { ?paper rdf:type iospress:Chapter } > UNION > { ?paper rdf:type iospress:Article } > ?paper iospress:publicationDate ?pubDate ; > iospress:publicationIncludesKeyword ?keyword; > iospress:publicationAuthorList [ ?idx ?author ] . > ?issueOrBook iospress:partOf ?volumeOrSerie . > ?paper iospress:partOf ?issueOrBook > OPTIONAL > { ?paper iospress:publicationAccessibility ?access } > OPTIONAL > { ?volumeOrSerie > iospress:partOf ?journal > } > FILTER ( ( ( ( ( datatype(?pubDate) = xsd:date ) && ( > xsd:dateTime(?pubDate) > "1999-12-31T23:00:00.000Z"^^xsd:dateTime ) ) > && ( xsd:dateTime(?pubDate) < "2021-05-18T12:16:58.841Z"^^xsd:dateTime > ) ) || ( ( ( datatype(?pubDate) = xsd:gYear ) && ( ?pubDate >= > "2000"^^xsd:gYear ) ) && ( ?pubDate <= "2021"^^xsd:gYear ) ) ) && > regex(?keyword, "sickness", "i") ) > } > ORDER BY ?pubDate ?paper > LIMIT 50 > } > > On Thu, May 20, 2021 at 10:44 AM Martin Van Aken > <mar...@joyouscoding.com> wrote: > > > > Andy, > > A big thanks for this - it gives me some paths to explore. I think indeed > > my biggest problems are in the optional parts - I'll run the test you > > advised and also look in which case I may be able to get rid of the > > optionals to avoid those situations that could lead to a big amount of > > results as you mentioned. I'm already looking at getting my filters > closer > > to definition - can this be done for things other than pure equality (for > > example for the date that are testing for a range?). > > > > Maybe one question about optional - I use them in some cases to avoid > empty > > results. An example is Access - some paper have an Access triple (Open or > > Closed) - but some have none. My understanding is that if I make a link > > without optional like: > > > > ?paper iospress:accessibility ?access > > > > this will de facto remove all papers without access from the set. This is > > something I don't want (I want them in the list, just with an empty value > > there) - and my understanding is that the way to manage this is an > > Optional. Is this correct? Is there a "better" way? If this ends up being > > costly, I could also check to actually have a value for those (those > > without value are technically "Closed"). > > > > Something I was wondering also is whether it makes sense to split the > > fields I need for search/filtering vs the ones I want to see on the > result. > > I've a feeling that in theory I could play with two queries - one with > only > > the params I need for the filtering, then play something similar to > > DESCRIBE on each record on the filtered set - but I've no idea if this > > would be more performant than keeping it together as it is now. > > > > Anyway, the exchanges here are much appreciated! > > > > On Tue, 18 May 2021 at 19:18, Andy Seaborne <a...@apache.org> wrote: > > > > > Martin, > > > > > > That's a complicated query and I haven't got my head aroud it > completely > > > yet. > > > > > > There are some useful points to understand: > > > > > > A:: > > > > > > What is the time and outcome of these queries that focus on the main > > > data location part: > > > > > > 1/ > > > > > > SELECT (count(*) AS ?C) { > > > ?paper iospress:publicationDate ?pubDate > > > FILTER(...date test...) > > > } > > > > > > 2/ > > > SELECT (count(*) AS ?C) { > > > ?paper iospress:publicationDate ?pubDate > > > iospress:publicationIncludesKeyword ?keyword . > > > FILETER (...date... && (regex (?keyword, "sickness", "i")) > > > > > > 3/ > > > SELECT (count(*) AS ?C) { > > > {?paper rdf:type iospress:Chapter.} > > > union > > > {?paper rdf:type iospress:Article.} > > > ?paper iospress:publicationDate ?pubDate > > > FILTER(...date test)) > > > } > > > > > > 4/ > > > SELECT (count(*) AS ?C) { > > > ?paper iospress:publicationDate ?pubDate > > > FILTER(.. date test...) > > > {?paper rdf:type iospress:Chapter.} > > > union > > > {?paper rdf:type iospress:Article.} > > > } > > > > > > B:: > > > > > > then is it the case that some optionals have more effect than others? > > > Some are "high risk": > > > > > > --- > > > OPTIONAL { > > > ?author iospress:contributorAffiliation ?affiliation. > > > ?affiliation rdfs:label ?university; > > > } > > > OPTIONAL { > > > ?affiliation iospress:geocodingOutput ?geocoded. > > > ?geocoded iospress-geocode:country ?country > > > } > > > --- > > > Suppose the first does not match then the second is a lot of results > > > unrelated to ?paper. > > > > > > C:: > > > > > > distinct > > > > > > it might be worth trying without distinct because distinct can cause a > > > lot of results to be reduced to just a few, hiding redundant work. > > > > > > Andy > > > > > > On 18/05/2021 13:31, Martin Van Aken wrote: > > > > Hello again, > > > > After some more days of me trying to get a better performance & the > > > > approval of my company, here is what we try to run (query at the > bottom > > > of > > > > the mail). > > > > > > > > For some context: > > > > > > > > - This is a search for academia papers. Papers have multiple > authors, and > > > > authors are part of multiple universities. Papers also have multiple > > > > keywords and are generally part of a set (an issue) itself part of a > set > > > (a > > > > volume) itself part of a set (a journal). > > > > - Our goal is to have a multicriteria search front end, so the query > is > > > > generated from a form with clauses selected by the user. The > structure is > > > > always the same, this example use a single condition on the "keyword" > > > > - The set of data is relatively small - around 150k papers (so > probably > > > 1M > > > > triples there), probably around 500k authors > > > > - We use group/concat as we want to give as results one line per > paper > > > (vs > > > > having one per paper per keyword for example) > > > > - I've read OPTIONALS are pretty bad - but I've no real alternative > here > > > > that I know off when some fields can be present or not and I don't > want > > > to > > > > throw away all that miss at least one > > > > > > > > For our current results, all but the most precise queries (getting > into a > > > > super limited set of papers, like <10) get extremely slow (easily to > > > dozens > > > > of seconds, sometimes more). I feel that there is something obvious > that > > > > I'm missing, either in the query or my Jena config. The server is on > an > > > old > > > > version but I make my tests locally on a 4.0.0 "out of the box" (0 > > > > configuration). > > > > > > > > What I've tried: > > > > > > > > - Removing the ORDER does not impact much > > > > - Removing most optionals works... but remove the point of the query > > > > - Using contains instead of regex does not impact much (I've the > goal to > > > > use Jena/Lucene integration for everything text related) > > > > > > > > I'm really in for an opinion as taking my RDBMS background this is > the > > > > equivalent of less than 3M records split on around 8 tables - > something > > > > that should be queryable mostly in sub second times. > > > > > > > > Any feedback is most welcome ! > > > > > > > > Martin > > > > > > > > PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> > > > > PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> > > > > PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/> > > > > PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/> > > > > PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/> > > > > PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> > > > > > > > > SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access > > > > (group_concat(distinct ?authorName;separator=", ") as > ?Authors) > > > > (group_concat(distinct ?keyword;separator=", ") as > ?keywords) > > > > (group_concat(distinct ?university;separator=", ") as > > > ?universities) > > > > (group_concat(distinct ?country;separator=", ") as > ?countries) > > > > WHERE { > > > > {?paper rdf:type iospress:Chapter.} > > > > union > > > > {?paper rdf:type iospress:Article.} > > > > > > > > ?paper rdfs:label ?title; > > > > rdf:type ?type; > > > > > > > > iospress:publicationDate ?pubDate; > > > > iospress:publicationAbstract ?abstract; > > > > > > > > iospress:publicationIncludesKeyword ?keyword; > > > > iospress:publicationAuthorList [?idx ?author]. > > > > > > > > ?issueOrBook iospress:partOf ?volumeOrSerie. > > > > ?paper iospress:partOf ?issueOrBook. > > > > > > > > > > > > OPTIONAL { > > > > ?issueOrBook iospress:isbn ?bookIsbn. > > > > } > > > > OPTIONAL { > > > > ?paper iospress:publicationDoiUrl ?doi. > > > > } > > > > OPTIONAL { > > > > ?author rdfs:label ?authorName. > > > > } > > > > OPTIONAL { > > > > ?author iospress:contributorAffiliation ?affiliation. > > > > ?affiliation rdfs:label ?university; > > > > } > > > > OPTIONAL { > > > > ?affiliation iospress:geocodingOutput ?geocoded. > > > > ?geocoded iospress-geocode:country ?country > > > > } > > > > OPTIONAL { > > > > ?paper iospress:publicationAccessibility ?access. > > > > } > > > > OPTIONAL { > > > > ?volumeOrSerie iospress:partOf ?journal; > > > > } > > > > FILTER( > > > > ( > > > > (datatype(?pubDate) = xsd:date && > xsd:dateTime(?pubDate) > > > > > "1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) < > > > > "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) || > > > > (datatype(?pubDate) = xsd:gYear && ?pubDate >= > > > > "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear) > > > > ) > > > > > > > > && (regex (?keyword, "sickness", "i")) > > > > ) > > > > } > > > > GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access > > > > > > > > ORDER BY ?pubDate ?paper > > > > LIMIT 50 > > > > > > > > > > > > On Thu, 6 May 2021 at 20:10, Andy Seaborne <a...@apache.org> wrote: > > > > > > > >> Hi there, > > > >> > > > >> Showing the query would be helpful but some general remarks: > > > >> > > > >> 1/ If the query or the setup for Fuseki is needing more than the > default > > > >> heap size, then it might be that the Java JVM is getting into a > state of > > > >> heap exhaustion. This manifests as the CPU loading getting very > high. It > > > >> will seem like nothing is happening (waiting for response). > > > >> > > > >> 2/ The query may be expensive. > > > >> > > > >> Things to look for > > > >> * cross products - two parts of the query pattern that are not > > > >> connected. > > > >> > > > >> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database. > > > >> > > > >> * sort, spilling to disk or combined with a cross product the query. > > > >> > > > >> 3/ If no results are coming back, then the query is form that does > not > > > >> stream back - sort, or CONSTRUCT maybe. > > > >> > > > >> There was a useful presentation recently that talks about the > principles > > > >> of query efficiency. > > > >> > > > >> SPARQL Query Optimization with Pavel Klinov > > > >> https://www.youtube.com/watch?v=16eMswT2x2Y > > > >> > > > >> More inline: > > > >> > > > >> On 06/05/2021 09:54, Martin Van Aken wrote: > > > >>> Hi! > > > >>> I'm Martin, I'm a software developer new to the Triples/SPARQL > world. > > > I'm > > > >>> currently building queries against a Fuseki/TDB backend (that I can > > > work > > > >> on > > > >>> too) and I'm getting into significant performance problems > (including > > > >> never > > > >>> ending queries). > > > >> > > > >> Are updates also happening at the same time? > > > >> > > > >>> Despite what I thought was a good search on the apache > > > >>> jena website I could not find a lot of insight about performance > > > >>> investigation so I'm trying it here. > > > >>> > > > >>> Most of my data experience comes from the relational world (ex: > PG) so > > > >> I'm > > > >>> sometimes drawing comparisons there. > > > >>> > > > >>> To give some context my data set is around 15 linked concepts, > with the > > > >>> number of triples for each ranging from some hundreds to 500K - > total > > > >> less > > > >>> than 2 millions (documents/authors/publication kind of data). > > > >>> > > > >>> Unto questions: > > > >>> > > > >>> - When I'm facing a slow query, what are my investigation > > > options. Is > > > >>> there an equivalent of an "explain plan" in SQL pointing to > the > > > query > > > >>> specific slow points? What's the advised way for performance > > > checks > > > >> in > > > >>> SPARQL? > > > >> > > > >> qparse --print=opt --file query.rq > > > >> > > > >>> - Are there any performance setups to be aware of on the > server > > > side? > > > >>> Like ways to check indexes are correctly built (outside of > text > > > >> search that > > > >>> I'm not working with for the moment) > > > >>> - We're currently using TDB1. I've seen the transactional > > > benefits of > > > >>> TDB2 - are there performance improvements too that would > warrant a > > > >>> migration there ? > > > >> > > > >> Not on the query side. > > > >> > > > >> Andy > > > >> > > > >>> > > > >>> Thanks a lot already! > > > >>> > > > >>> Martin > > > >>> > > > >> > > > > > > > > > > > > > > > > > -- > > *Martin Van Aken - **Freelance Enthusiast Developer* > > > > Mobile : +32 486 899 652 > > > > Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken> > > Call me on Skype : vanakenm > > Hang out with me : mar...@joyouscoding.com > > Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken > > Company website : www.joyouscoding.com > -- *Martin Van Aken - **Freelance Enthusiast Developer* Mobile : +32 486 899 652 Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken> Call me on Skype : vanakenm Hang out with me : mar...@joyouscoding.com Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken Company website : www.joyouscoding.com