Hi Andy, Thanks for the pointers.
I have been digging a bit more into the plans for the actual queries against the actual datasets and see that a join is used. Here's an example of the query: ``` prefix mfg: <http://example.com/def/mfg/> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix xsd: <http://www.w3.org/2001/XMLSchema#> select distinct ?uid where { { select ?wafer ?uid (?X - ?ref_X as ?X_) (?Y - ?ref_Y as ?Y_) { graph ?g_uid { [] a mfg:ReferenceDevice ; rdfs:label "ref die"^^xsd:string ; mfg:location [ mfg:xPos ?ref_X ; mfg:yPos ?ref_Y ] . [] mfg:object [ rdfs:label ?uid ] ; mfg:location [ mfg:xPos ?X ; mfg:yPos ?Y ; mfg:containedIn ?wafer ] } }} { select ?wafer (?X - ?ref_X as ?X_) (?Y - ?ref_Y as ?Y_) { graph ?g_fwt { [] a mfg:ReferenceDevice ; rdfs:label "Prober Reference"^^xsd:string ; mfg:location [ mfg:xPos ?ref_X ; mfg:yPos ?ref_Y ] . [] mfg:binCode [ mfg:pick true ] ; mfg:location [ mfg:xPos ?X ; mfg:yPos ?Y ; mfg:containedIn ?wafer ] } }} } ``` The gist of this is that we try to normalize the coordinates from two maps based on a common reference point. And the algebra generated from running that query against some example data: ``` (distinct (project (?uid) (join (project (?wafer ?uid ?X_ ?Y_) (extend ((?X_ (- ?/X ?/ref_X)) (?Y_ (- ?/Y ?/ref_Y))) (graph ?/g_uid (bgp (triple ?/?0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/def/mfg/ReferenceDevice>) (triple ?/?0 <http://www.w3.org/2000/01/rdf-schema#label> "ref die") (triple ?/?0 <http://example.com/def/mfg/location> ?/?1) (triple ?/?1 <http://example.com/def/mfg/xPos> ?/ref_X) (triple ?/?1 <http://example.com/def/mfg/yPos> ?/ref_Y) (triple ?/?2 <http://example.com/def/mfg/object> ?/?3) (triple ?/?3 <http://www.w3.org/2000/01/rdf-schema#label> ?uid) (triple ?/?2 <http://example.com/def/mfg/location> ?/?4) (triple ?/?4 <http://example.com/def/mfg/xPos> ?/X) (triple ?/?4 <http://example.com/def/mfg/yPos> ?/Y) (triple ?/?4 <http://example.com/def/mfg/containedIn> ?wafer) )))) (project (?wafer ?X_ ?Y_) (extend ((?X_ (- ?/X ?/ref_X)) (?Y_ (- ?/Y ?/ref_Y))) (graph ?/g_fwt (bgp (triple ?/?5 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/def/mfg/ReferenceDevice>) (triple ?/?5 <http://www.w3.org/2000/01/rdf-schema#label> "Prober Reference") (triple ?/?5 <http://example.com/def/mfg/location> ?/?6) (triple ?/?6 <http://example.com/def/mfg/xPos> ?/ref_X) (triple ?/?6 <http://example.com/def/mfg/yPos> ?/ref_Y) (triple ?/?7 <http://example.com/def/mfg/binCode> ?/?8) (triple ?/?8 <http://example.com/def/mfg/pick> true) (triple ?/?7 <http://example.com/def/mfg/location> ?/?9) (triple ?/?9 <http://example.com/def/mfg/xPos> ?/X) (triple ?/?9 <http://example.com/def/mfg/yPos> ?/Y) (triple ?/?9 <http://example.com/def/mfg/containedIn> ?wafer) ))))))) ``` The query requires over 30 minutes to complete. The cardinality on the left and right side of the join is around 125k. So large, but not huge. If I run the subqueries in isolation, they complete in a reasonable time of ~30 seconds. Serializing the results to CSV and doing the join with Pandas takes a few seconds. ``` from pandas import read_csv df_1 = read_csv('subquery_1.csv', encoding = 'utf-8').set_index(['wafer','X_', 'Y_']) df_2 = read_csv('subquery_2.csv', encoding = 'utf-8').set_index(['wafer','X_', 'Y_']) df_joined = df_1.join(df_2) print(df_joined) ``` Admittedly I am telling Pandas which fields to index, but I assume that should be self-evident from the SPARQL query text. Does it help to make some example RDF data available to help see if Jena can be improved for such use cases, or am I better off just using Panda for that last join? Regards, John > -----Original Message----- > From: Andy Seaborne <[email protected]> > Sent: Monday, 1 April 2024 19:04 > To: [email protected] > Subject: Re: Performance question with joins > > Hi John, > > Yes, the join of two large subqueries is the issue. > > Optimization involves making pragmatic determination. Sometimes it isn't > optimal for some data. > > Something to consider is detecting these independency of the (?X_i, > ?X_j) and (?Y_i, ?Y_j) blocks because ia hash join is likely a better choice. > That, > or caching part evaluations there is cross-product like effects. > > Thank you for the details. > > Andy > > See also: > > "A Worst-Case Optimal Join Algorithm for SPARQL" > https://aidanhogan.com/docs/SPARQL_worst_case_optimal.pdf > > "Leapfrog Triejoin: A Simple, Worst-Case Optimal Join" > Algorithm > https://www.openproceedings.org/2014/conf/icdt/Veldhuizen14.pdf > > On 29/03/2024 09:25, John Walker wrote: > > I did some more experimentation and checked the query algebra using the -- > explain option. > > > > For sake of simplicity I use a simpler query: > > > > ``` > > select (count(*) as ?C) > > where { > > { > > select ?X ?Y (struuid() as ?UUID) > > where { > > values ?X { 0 1 } > > values ?Y { 0 1 } > > } > > } > > { > > select ?X ?Y > > where { > > { > > select ?X ?Y (rand() as ?RAND) > > where { > > values ?X { 0 1 } > > values ?Y { 0 1 } > > } > > } > > filter (?RAND < 0.95) > > } > > } > > } > > ``` > > > > For this the algebra is: > > > > ``` > > (project (?C) > > (extend ((?C ?.0)) > > (group () ((?.0 (count))) > > (sequence > > (project (?X ?Y ?UUID) > > (extend ((?UUID (struuid))) > > (sequence > > (table (vars ?Y) > > (row [?Y 0]) > > (row [?Y 1]) > > ) > > (table (vars ?X) > > (row [?X 0]) > > (row [?X 1]) > > )))) > > (project (?X ?Y) > > (project (?X ?Y ?/RAND) > > (filter (< ?/RAND 0.95) > > (extend ((?/RAND (rand))) > > (sequence > > (table (vars ?Y) > > (row [?Y 0]) > > (row [?Y 1]) > > ) > > (table (vars ?X) > > (row [?X 0]) > > (row [?X 1]) > > )))))))))) > > ``` > > > > Whilst if I make a small change to also project some other variable > > from the second subquery > > > > ``` > > select (count(*) as ?C) > > where { > > { > > select ?X ?Y (struuid() as ?UUID) > > where { > > values ?X { 0 1 } > > values ?Y { 0 1 } > > } > > } > > { > > select ?X ?Y (0 as ?_) > > where { > > { > > select ?X ?Y (rand() as ?RAND) > > where { > > values ?X { 0 1 } > > values ?Y { 0 1 } > > } > > } > > filter (?RAND < 0.95) > > } > > } > > } > > ``` > > > > Then the algebra is: > > > > ``` > > (project (?C) > > (extend ((?C ?.0)) > > (group () ((?.0 (count))) > > (join > > (project (?X ?Y ?UUID) > > (extend ((?UUID (struuid))) > > (sequence > > (table (vars ?Y) > > (row [?Y 0]) > > (row [?Y 1]) > > ) > > (table (vars ?X) > > (row [?X 0]) > > (row [?X 1]) > > )))) > > (project (?X ?Y ?_) > > (extend ((?_ 0)) > > (project (?X ?Y ?/RAND) > > (filter (< ?/RAND 0.95) > > (extend ((?/RAND (rand))) > > (sequence > > (table (vars ?Y) > > (row [?Y 0]) > > (row [?Y 1]) > > ) > > (table (vars ?X) > > (row [?X 0]) > > (row [?X 1]) > > ))))))))))) > > ``` > > > > Note the outermost sequence operator has changed to a join operator. > > I don’t understand the logic behind that. > > > > Note that projecting the ?RAND variable from the second query does not > force the join. > > > > John > > > >> -----Original Message----- > >> From: John Walker <[email protected]> > >> Sent: Friday, 29 March 2024 08:55 > >> To: [email protected] > >> Subject: RE: Performance question with joins > >> > >> I did a bit more experimentation by putting the second subquery > >> inside some other clauses: > >> > >> * FILTER EXISTS - no effect > >> * OPTIONAL - runtime around 0.5s > >> * MINUS - runtime around 0.5s > >> > >> So, I assume that the engine is doing some form of nested loop join > >> to iterate through each solution from the first subquery and test against > the second. > >> Same as what is happening with FILTER EXISTS. > >> > >> A "hack" to get around this seems to be to add a redundant MINUS {} > >> between the subqueries. > >> > >> John > >> > >>> -----Original Message----- > >>> From: John Walker <[email protected]> > >>> Sent: Friday, 29 March 2024 07:58 > >>> To: jena-users-ml <[email protected]> > >>> Subject: Performance question with joins > >>> > >>> Hi, > >>> > >>> I am working with some data representing a 2D Cartesian coordinate > >>> system representing simple grid array “maps” > >>> The X and Y coordinates are represented as integers. > >>> > >>> I want to join data from different “layers” in the data. > >>> One layer contains a unique identifier for each position. > >>> The other layer only contains a subset of coordinates. > >>> > >>> I have written the following queries to simulate some data to > >>> demonstrate the issue I face. > >>> > >>> This query generates a solution field with the XY coordinates and an > >>> identifier for each: > >>> > >>> ``` > >>> select ?X ?Y (struuid() as ?UUID) > >>> where { > >>> values ?X_i { 0 1 2 3 4 5 6 7 8 9 } > >>> values ?X_j { 0 1 2 3 4 5 6 7 8 9 } > >>> bind ( ?X_i + 10 * ?X_j as ?X) > >>> values ?Y_i { 0 1 2 3 4 5 6 7 8 9 } > >>> values ?Y_j { 0 1 2 3 4 5 6 7 8 9 } > >>> bind ( ?Y_i + 10 * ?Y_j as ?Y) > >>> } > >>> ``` > >>> > >>> This query generates a filtered subset of the XY coordinates: > >>> > >>> ``` > >>> select ?X ?Y > >>> where { > >>> { > >>> select ?X ?Y (rand() as ?RAND) > >>> where { > >>> values ?X_i { 0 1 2 3 4 5 6 7 8 9 } > >>> values ?X_j { 0 1 2 3 4 5 6 7 8 9 } > >>> bind ( ?X_i + 10 * ?X_j as ?X) > >>> values ?Y_i { 0 1 2 3 4 5 6 7 8 9 } > >>> values ?Y_j { 0 1 2 3 4 5 6 7 8 9 } > >>> bind ( ?Y_i + 10 * ?Y_j as ?Y) > >>> } > >>> } > >>> filter (?RAND < 0.95) > >>> } > >>> ``` > >>> > >>> Now if I run each of those and write the results out as CSV using > >>> the sparql command line tool, each completes in under half a second. > >>> > >>> I can then join the results using Pandas in Python, which takes less > >>> than a second. > >>> > >>> However, if I try to do the join within the SPARQL process usings > >>> the following query with subqueries, it takes around 10 seconds: > >>> > >>> ``` > >>> select (count(*) as ?C) > >>> where { > >>> { > >>> select ?X ?Y (struuid() as ?UUID) > >>> where { > >>> values ?X_i { 0 1 2 3 4 5 6 7 8 9 } > >>> values ?X_j { 0 1 2 3 4 5 6 7 8 9 } > >>> bind ( ?X_i + 10 * ?X_j as ?X) > >>> values ?Y_i { 0 1 2 3 4 5 6 7 8 9 } > >>> values ?Y_j { 0 1 2 3 4 5 6 7 8 9 } > >>> bind ( ?Y_i + 10 * ?Y_j as ?Y) > >>> } > >>> } > >>> { > >>> select ?X ?Y > >>> where { > >>> { > >>> select ?X ?Y (rand() as ?RAND) > >>> where { > >>> values ?X_i { 0 1 2 3 4 5 6 7 8 9 } > >>> values ?X_j { 0 1 2 3 4 5 6 7 8 9 } > >>> bind ( ?X_i + 10 * ?X_j as ?X) > >>> values ?Y_i { 0 1 2 3 4 5 6 7 8 9 } > >>> values ?Y_j { 0 1 2 3 4 5 6 7 8 9 } > >>> bind ( ?Y_i + 10 * ?Y_j as ?Y) > >>> } > >>> } > >>> filter (?RAND < 0.95) > >>> } > >>> } > >>> } > >>> ``` > >>> > >>> In practice the “maps” I need to process contain hundreds of > >>> thousands of coordinates and then the final join using Jena becomes > >>> intractable in terms of memory and time. > >>> Whilst if I run each subquery separately and use Pandas for the > >>> join, it completes without breaking a sweat. > >>> > >>> I thought SPARQL engines were optimized for joins, so how come Jena > >>> gets absolutely spanked by Pandas here? > >>> > >>> Am I missing something when I write the queries? > >>> > >>> Regards, > >>> > >>> John Walker > >>> Principal Consultant & co-founder > >>> > >>> Semaku B.V. | Torenallee 20 (SFJ 3D) | 5617 BC Eindhoven | T +31 6 > >>> 42590072 | https://semaku.com/ > >>> KvK: 58031405 | BTW: NL852842156B01 | IBAN: NL94 INGB 0008 3219 > 95
