Re: Re: RDFS subPropertyOf property path query performance

Lorenz Buehmann Tue, 14 May 2024 03:19:43 -0700

Hi Christian,

thanks for sharing a self-contained project.

What happens if you avoid the FILTER IN expression(s), which can be tooexpensive as the filter happens. And maybe use inline data to restrictthe evaluation to given resources:



|PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX : <https://stati-cal.com/ontology/al/>
PREFIX aln: <https://stati-cal.com/ontology/al/bnode/>
SELECT (COUNT(1) AS ?cnt)
WHERE
{

  ?cmit a :StaticMethod ;
  :name "Commit" .
  ?decl (:possiblyCommittingProceduralFlow)+ ?cmit ;
         a ?procedureOrTrigger .
  VALUES ?procedureOrTrigger {:Procedure :Trigger}

  VALUES ?ownerType {
    :Table :TableExtension :Page :PageExtension
    :Report :ReportExtension :Codeunit :XmlPort
    :Query :ControlAddIn :Enum :EnumExtension
    :PageCustomization :Profile
    :DotNetPackage :Interface
    :PermissionSet :PermissionSetExtension
    :Entitlement :DotNet
  }
  ?owner  :contains+  ?decl ;
          a ?ownerType .

  ?decl :localKey ?localkey .
}|


That should at least be a bit faster if I'm not wrong.

You should also provide the TDB database with some statistics about thedata. Use tdb2.tdbstats for this to create a stats.opt file which youput in the Data-001 directory. This helps the optimizer in reordering oftriple patterns. Won't work for property paths though, but in generalit's a good idea to give it a try.



Cheers,
Lorenz

On 14.05.24 10:21, Christian Clausen wrote:

Hi Lorenz,

I have shared a Java project which includes data here:
https://drive.google.com/file/d/1MOQXNmTEmJBnzLIgQ3pQViQbiyvTT76q/view?usp=sharing

In GraphServer.java there is a variable USE_RDFS, which you can use to
switch between using RDFS and not.

In preparing the repro I realized that the performance difference only
occurs on more complex queries than what I originally thought.

The test query is this:

PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX :<https://stati-cal.com/ontology/al/>
PREFIX aln:<https://stati-cal.com/ontology/al/bnode/>

SELECT (COUNT(1) AS ?cnt)
WHERE
   { ?cmit a :StaticMethod ;
           :name "Commit" .
     ?decl (:possiblyCommittingProceduralFlow)+ ?cmit ;
           a ?procedureOrTrigger .
     FILTER(?procedureOrTrigger in (:Procedure, :Trigger))
     ?owner  :contains+  ?decl ;
             a ?ownerType .
     FILTER(?ownerType in (:Table, :TableExtension, :Page, :PageExtension,
                            :Report, :ReportExtension, :Codeunit, :XmlPort,
                            :Query, :ControlAddIn, :Enum, :EnumExtension,
                            :PageCustomization, :Profile,
:DotNetPackage, :Interface,
                            :PermissionSet, :PermissionSetExtension,
:Entitlement, :DotNet))
     ?decl :localKey ?localkey .
   }

(For simplicity, I have used count instead of selecting ?owner and
?localKey which we use in our application.)

This is how I can the tests:

curl -v -X POST --header "Content-Type: application/sparql-query"
--data-binary @test1.sparqlhttp://localhost:3030/CodeGraph/query

With RDFS enabled, the query runs in about 80 seconds.

With RDFS disabled, it takes about 1-2 seconds.

Interestingly, it if I leave out the part that begins with ?owner...:

SELECT (COUNT(1) AS ?cnt)
WHERE
   { ?cmit a :StaticMethod ;
           :name "Commit" .
     ?decl (:possiblyCommittingProceduralFlow)+ ?cmit ;
           a ?procedureOrTrigger .
     FILTER(?procedureOrTrigger in (:Procedure, :Trigger))
   }

Then performance is similar (and good) with and without RDFS.

/Christian


On Mon, 13 May 2024 at 12:04, Lorenz Buehmann <
buehm...@informatik.uni-leipzig.de> wrote:

Hi,

does it mean the ?origin is always bound to a resource in the graph? Can
you share the whole query maybe?

How long are the sequences in the graph? How many paths starting from a
node, i.e. what's the out degree in general per node?

Also, would it be possible to share some kind of data for investigation?

In general, the RDFS inference you're using is pretty light-weight,
running at query eval time - all it does at triple pattern eval time is
to incorporate in your case the rdfs:subProperty triple from the schema,
but it might indeed grow at each step on the path


Cheers,

Lorenz

On 13.05.24 09:41, Christian Clausen wrote:

In our graph we have :flow properties and need to distinguish different
kinds of flows, :flowA and :flowB.

We modelled this with in RDFS:

      :flowA rdfs:subPropertyOf :flow
      :flowB rdfs:subPropertyOf :flow

Some of our SPARQL queries use :flow+ and some use :flowA+, always from

an

origin:

      ?origin :flowA+ :?result

or

      ?origin :flow+ :?result

If we start Fuseki *without* RDFS, the following queries finish in a

second

or two:

      ?origin :flowA+ :?result
      ?origin :(flowA | :flowB)+ :?result

If we start Fuseki *with* RDFS, the following queries take about 85

seconds:

      ?origin :flowA+ :?result
      ?origin :flow+ :?result
What is causing this difference in performance? Are we missing something

or

should we avoid RDFS for optimal performance? Any other alternatives?

Our overall process is:

1. Generate TTL files with :flowA and :flowB properties (not :flow other
than implied by rdfs:subPropertyOf)
2. Load with TDB2 loader
3. Start Fuseki (with RDSF vocabulary or not)

Here follows the code we use to start Fuseki.

Without RDFS:

          *Dataset data = TDB2Factory.connectDataset(options.directory);*

          FusekiServer server = FusekiServer.create()
              .port(options.port)
              .loopback(true)
              *.addDataset(options.datasetName, data.asDatasetGraph())*
              .addEndpoint(options.datasetName, "query", Operation.Query)
              // shortestPath
              .registerOperation(shortestPathOp,

WebContent.contentTypeJSON,

new ShortestPathService())
              .addEndpoint(options.datasetName, "shortestPath",
shortestPathOp)
              // diagnostics
              .verbose(true)
              .enablePing(true)
              .enableStats(true)
              .enableMetrics(true)
              .enableTasks(true)
              .build();

          // Start
          server.start();

With RDFS:



*Dataset data = TDB2Factory.connectDataset(options.directory);

Graph

vocabulary = RDFDataMgr.loadGraph(options.vocabularyFileName);
DatasetGraph dsg = RDFSFactory.datasetRDFS(data.asDatasetGraph(),
vocabulary);*

          FusekiServer server = FusekiServer.create()
              .port(options.port)
              .loopback(true)
              *.addDataset(options.datasetName,dsg)*
              .addEndpoint(options.datasetName, "query", Operation.Query)
              // shortestPath
              .registerOperation(shortestPathOp,

WebContent.contentTypeJSON,

new ShortestPathService())
              .addEndpoint(options.datasetName, "shortestPath",
shortestPathOp)
              // diagnostics
              .verbose(true)
              .enablePing(true)
              .enableStats(true)
              .enableMetrics(true)
              .enableTasks(true)
              .build();

          // Start
          server.start();

--
Lorenz Bühmann
Research Associate/Scientific Developer

emailbuehm...@infai.org

Institute for Applied Informatics e.V. (InfAI) | Goerdelerring 9 | 04109
Leipzig | Germany

--
Lorenz Bühmann
Research Associate/Scientific Developer

emailbuehm...@infai.org

Institute for Applied Informatics e.V. (InfAI) | Goerdelerring 9 | 04109 
Leipzig | Germany

Re: Re: RDFS subPropertyOf property path query performance

Reply via email to