Hi John, I'm keen to have a go rewriting the query if you're able to make some data available / confirm the expected result.
Thanks On Tue, Apr 2, 2024 at 5:18 AM John Walker <[email protected]> wrote: > Hi Andy, > > Thanks for the pointers. > > I have been digging a bit more into the plans for the actual queries > against the actual datasets and see that a join is used. > > Here's an example of the query: > > ``` > prefix mfg: <http://example.com/def/mfg/> > prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> > prefix xsd: <http://www.w3.org/2001/XMLSchema#> > > select distinct ?uid > where { > { select ?wafer ?uid (?X - ?ref_X as ?X_) (?Y - ?ref_Y as ?Y_) { > graph ?g_uid { > [] a mfg:ReferenceDevice ; > rdfs:label "ref die"^^xsd:string ; > mfg:location [ mfg:xPos ?ref_X ; mfg:yPos ?ref_Y ] . > > [] mfg:object [ rdfs:label ?uid ] ; > mfg:location [ mfg:xPos ?X ; mfg:yPos ?Y ; mfg:containedIn ?wafer ] > } > }} > { select ?wafer (?X - ?ref_X as ?X_) (?Y - ?ref_Y as ?Y_) { > graph ?g_fwt { > [] a mfg:ReferenceDevice ; > rdfs:label "Prober Reference"^^xsd:string ; > mfg:location [ mfg:xPos ?ref_X ; mfg:yPos ?ref_Y ] . > > [] mfg:binCode [ mfg:pick true ] ; > mfg:location [ mfg:xPos ?X ; mfg:yPos ?Y ; mfg:containedIn ?wafer ] > } > }} > } > ``` > > The gist of this is that we try to normalize the coordinates from two maps > based on a common reference point. > > And the algebra generated from running that query against some example > data: > > ``` > (distinct > (project (?uid) > (join > (project (?wafer ?uid ?X_ ?Y_) > (extend ((?X_ (- ?/X ?/ref_X)) (?Y_ (- ?/Y ?/ref_Y))) > (graph ?/g_uid > (bgp > (triple ?/?0 < > http://www.w3.org/1999/02/22-rdf-syntax-ns#type> < > http://example.com/def/mfg/ReferenceDevice>) > (triple ?/?0 <http://www.w3.org/2000/01/rdf-schema#label> > "ref die") > (triple ?/?0 <http://example.com/def/mfg/location> ?/?1) > (triple ?/?1 <http://example.com/def/mfg/xPos> ?/ref_X) > (triple ?/?1 <http://example.com/def/mfg/yPos> ?/ref_Y) > (triple ?/?2 <http://example.com/def/mfg/object> ?/?3) > (triple ?/?3 <http://www.w3.org/2000/01/rdf-schema#label> > ?uid) > (triple ?/?2 <http://example.com/def/mfg/location> ?/?4) > (triple ?/?4 <http://example.com/def/mfg/xPos> ?/X) > (triple ?/?4 <http://example.com/def/mfg/yPos> ?/Y) > (triple ?/?4 <http://example.com/def/mfg/containedIn> > ?wafer) > )))) > (project (?wafer ?X_ ?Y_) > (extend ((?X_ (- ?/X ?/ref_X)) (?Y_ (- ?/Y ?/ref_Y))) > (graph ?/g_fwt > (bgp > (triple ?/?5 < > http://www.w3.org/1999/02/22-rdf-syntax-ns#type> < > http://example.com/def/mfg/ReferenceDevice>) > (triple ?/?5 <http://www.w3.org/2000/01/rdf-schema#label> > "Prober Reference") > (triple ?/?5 <http://example.com/def/mfg/location> ?/?6) > (triple ?/?6 <http://example.com/def/mfg/xPos> ?/ref_X) > (triple ?/?6 <http://example.com/def/mfg/yPos> ?/ref_Y) > (triple ?/?7 <http://example.com/def/mfg/binCode> ?/?8) > (triple ?/?8 <http://example.com/def/mfg/pick> true) > (triple ?/?7 <http://example.com/def/mfg/location> ?/?9) > (triple ?/?9 <http://example.com/def/mfg/xPos> ?/X) > (triple ?/?9 <http://example.com/def/mfg/yPos> ?/Y) > (triple ?/?9 <http://example.com/def/mfg/containedIn> > ?wafer) > ))))))) > ``` > > The query requires over 30 minutes to complete. > > The cardinality on the left and right side of the join is around 125k. > So large, but not huge. > > If I run the subqueries in isolation, they complete in a reasonable time > of ~30 seconds. > Serializing the results to CSV and doing the join with Pandas takes a few > seconds. > > ``` > from pandas import read_csv > > df_1 = read_csv('subquery_1.csv', encoding = > 'utf-8').set_index(['wafer','X_', 'Y_']) > df_2 = read_csv('subquery_2.csv', encoding = > 'utf-8').set_index(['wafer','X_', 'Y_']) > > df_joined = df_1.join(df_2) > > print(df_joined) > ``` > > Admittedly I am telling Pandas which fields to index, but I assume that > should be self-evident from the SPARQL query text. > > Does it help to make some example RDF data available to help see if Jena > can be improved for such use cases, or am I better off just using Panda for > that last join? > > Regards, > John > > > -----Original Message----- > > From: Andy Seaborne <[email protected]> > > Sent: Monday, 1 April 2024 19:04 > > To: [email protected] > > Subject: Re: Performance question with joins > > > > Hi John, > > > > Yes, the join of two large subqueries is the issue. > > > > Optimization involves making pragmatic determination. Sometimes it isn't > > optimal for some data. > > > > Something to consider is detecting these independency of the (?X_i, > > ?X_j) and (?Y_i, ?Y_j) blocks because ia hash join is likely a better > choice. That, > > or caching part evaluations there is cross-product like effects. > > > > Thank you for the details. > > > > Andy > > > > See also: > > > > "A Worst-Case Optimal Join Algorithm for SPARQL" > > https://aidanhogan.com/docs/SPARQL_worst_case_optimal.pdf > > > > "Leapfrog Triejoin: A Simple, Worst-Case Optimal Join" > > Algorithm > > https://www.openproceedings.org/2014/conf/icdt/Veldhuizen14.pdf > > > > On 29/03/2024 09:25, John Walker wrote: > > > I did some more experimentation and checked the query algebra using > the -- > > explain option. > > > > > > For sake of simplicity I use a simpler query: > > > > > > ``` > > > select (count(*) as ?C) > > > where { > > > { > > > select ?X ?Y (struuid() as ?UUID) > > > where { > > > values ?X { 0 1 } > > > values ?Y { 0 1 } > > > } > > > } > > > { > > > select ?X ?Y > > > where { > > > { > > > select ?X ?Y (rand() as ?RAND) > > > where { > > > values ?X { 0 1 } > > > values ?Y { 0 1 } > > > } > > > } > > > filter (?RAND < 0.95) > > > } > > > } > > > } > > > ``` > > > > > > For this the algebra is: > > > > > > ``` > > > (project (?C) > > > (extend ((?C ?.0)) > > > (group () ((?.0 (count))) > > > (sequence > > > (project (?X ?Y ?UUID) > > > (extend ((?UUID (struuid))) > > > (sequence > > > (table (vars ?Y) > > > (row [?Y 0]) > > > (row [?Y 1]) > > > ) > > > (table (vars ?X) > > > (row [?X 0]) > > > (row [?X 1]) > > > )))) > > > (project (?X ?Y) > > > (project (?X ?Y ?/RAND) > > > (filter (< ?/RAND 0.95) > > > (extend ((?/RAND (rand))) > > > (sequence > > > (table (vars ?Y) > > > (row [?Y 0]) > > > (row [?Y 1]) > > > ) > > > (table (vars ?X) > > > (row [?X 0]) > > > (row [?X 1]) > > > )))))))))) > > > ``` > > > > > > Whilst if I make a small change to also project some other variable > > > from the second subquery > > > > > > ``` > > > select (count(*) as ?C) > > > where { > > > { > > > select ?X ?Y (struuid() as ?UUID) > > > where { > > > values ?X { 0 1 } > > > values ?Y { 0 1 } > > > } > > > } > > > { > > > select ?X ?Y (0 as ?_) > > > where { > > > { > > > select ?X ?Y (rand() as ?RAND) > > > where { > > > values ?X { 0 1 } > > > values ?Y { 0 1 } > > > } > > > } > > > filter (?RAND < 0.95) > > > } > > > } > > > } > > > ``` > > > > > > Then the algebra is: > > > > > > ``` > > > (project (?C) > > > (extend ((?C ?.0)) > > > (group () ((?.0 (count))) > > > (join > > > (project (?X ?Y ?UUID) > > > (extend ((?UUID (struuid))) > > > (sequence > > > (table (vars ?Y) > > > (row [?Y 0]) > > > (row [?Y 1]) > > > ) > > > (table (vars ?X) > > > (row [?X 0]) > > > (row [?X 1]) > > > )))) > > > (project (?X ?Y ?_) > > > (extend ((?_ 0)) > > > (project (?X ?Y ?/RAND) > > > (filter (< ?/RAND 0.95) > > > (extend ((?/RAND (rand))) > > > (sequence > > > (table (vars ?Y) > > > (row [?Y 0]) > > > (row [?Y 1]) > > > ) > > > (table (vars ?X) > > > (row [?X 0]) > > > (row [?X 1]) > > > ))))))))))) > > > ``` > > > > > > Note the outermost sequence operator has changed to a join operator. > > > I don’t understand the logic behind that. > > > > > > Note that projecting the ?RAND variable from the second query does not > > force the join. > > > > > > John > > > > > >> -----Original Message----- > > >> From: John Walker <[email protected]> > > >> Sent: Friday, 29 March 2024 08:55 > > >> To: [email protected] > > >> Subject: RE: Performance question with joins > > >> > > >> I did a bit more experimentation by putting the second subquery > > >> inside some other clauses: > > >> > > >> * FILTER EXISTS - no effect > > >> * OPTIONAL - runtime around 0.5s > > >> * MINUS - runtime around 0.5s > > >> > > >> So, I assume that the engine is doing some form of nested loop join > > >> to iterate through each solution from the first subquery and test > against > > the second. > > >> Same as what is happening with FILTER EXISTS. > > >> > > >> A "hack" to get around this seems to be to add a redundant MINUS {} > > >> between the subqueries. > > >> > > >> John > > >> > > >>> -----Original Message----- > > >>> From: John Walker <[email protected]> > > >>> Sent: Friday, 29 March 2024 07:58 > > >>> To: jena-users-ml <[email protected]> > > >>> Subject: Performance question with joins > > >>> > > >>> Hi, > > >>> > > >>> I am working with some data representing a 2D Cartesian coordinate > > >>> system representing simple grid array “maps” > > >>> The X and Y coordinates are represented as integers. > > >>> > > >>> I want to join data from different “layers” in the data. > > >>> One layer contains a unique identifier for each position. > > >>> The other layer only contains a subset of coordinates. > > >>> > > >>> I have written the following queries to simulate some data to > > >>> demonstrate the issue I face. > > >>> > > >>> This query generates a solution field with the XY coordinates and an > > >>> identifier for each: > > >>> > > >>> ``` > > >>> select ?X ?Y (struuid() as ?UUID) > > >>> where { > > >>> values ?X_i { 0 1 2 3 4 5 6 7 8 9 } > > >>> values ?X_j { 0 1 2 3 4 5 6 7 8 9 } > > >>> bind ( ?X_i + 10 * ?X_j as ?X) > > >>> values ?Y_i { 0 1 2 3 4 5 6 7 8 9 } > > >>> values ?Y_j { 0 1 2 3 4 5 6 7 8 9 } > > >>> bind ( ?Y_i + 10 * ?Y_j as ?Y) > > >>> } > > >>> ``` > > >>> > > >>> This query generates a filtered subset of the XY coordinates: > > >>> > > >>> ``` > > >>> select ?X ?Y > > >>> where { > > >>> { > > >>> select ?X ?Y (rand() as ?RAND) > > >>> where { > > >>> values ?X_i { 0 1 2 3 4 5 6 7 8 9 } > > >>> values ?X_j { 0 1 2 3 4 5 6 7 8 9 } > > >>> bind ( ?X_i + 10 * ?X_j as ?X) > > >>> values ?Y_i { 0 1 2 3 4 5 6 7 8 9 } > > >>> values ?Y_j { 0 1 2 3 4 5 6 7 8 9 } > > >>> bind ( ?Y_i + 10 * ?Y_j as ?Y) > > >>> } > > >>> } > > >>> filter (?RAND < 0.95) > > >>> } > > >>> ``` > > >>> > > >>> Now if I run each of those and write the results out as CSV using > > >>> the sparql command line tool, each completes in under half a second. > > >>> > > >>> I can then join the results using Pandas in Python, which takes less > > >>> than a second. > > >>> > > >>> However, if I try to do the join within the SPARQL process usings > > >>> the following query with subqueries, it takes around 10 seconds: > > >>> > > >>> ``` > > >>> select (count(*) as ?C) > > >>> where { > > >>> { > > >>> select ?X ?Y (struuid() as ?UUID) > > >>> where { > > >>> values ?X_i { 0 1 2 3 4 5 6 7 8 9 } > > >>> values ?X_j { 0 1 2 3 4 5 6 7 8 9 } > > >>> bind ( ?X_i + 10 * ?X_j as ?X) > > >>> values ?Y_i { 0 1 2 3 4 5 6 7 8 9 } > > >>> values ?Y_j { 0 1 2 3 4 5 6 7 8 9 } > > >>> bind ( ?Y_i + 10 * ?Y_j as ?Y) > > >>> } > > >>> } > > >>> { > > >>> select ?X ?Y > > >>> where { > > >>> { > > >>> select ?X ?Y (rand() as ?RAND) > > >>> where { > > >>> values ?X_i { 0 1 2 3 4 5 6 7 8 9 } > > >>> values ?X_j { 0 1 2 3 4 5 6 7 8 9 } > > >>> bind ( ?X_i + 10 * ?X_j as ?X) > > >>> values ?Y_i { 0 1 2 3 4 5 6 7 8 9 } > > >>> values ?Y_j { 0 1 2 3 4 5 6 7 8 9 } > > >>> bind ( ?Y_i + 10 * ?Y_j as ?Y) > > >>> } > > >>> } > > >>> filter (?RAND < 0.95) > > >>> } > > >>> } > > >>> } > > >>> ``` > > >>> > > >>> In practice the “maps” I need to process contain hundreds of > > >>> thousands of coordinates and then the final join using Jena becomes > > >>> intractable in terms of memory and time. > > >>> Whilst if I run each subquery separately and use Pandas for the > > >>> join, it completes without breaking a sweat. > > >>> > > >>> I thought SPARQL engines were optimized for joins, so how come Jena > > >>> gets absolutely spanked by Pandas here? > > >>> > > >>> Am I missing something when I write the queries? > > >>> > > >>> Regards, > > >>> > > >>> John Walker > > >>> Principal Consultant & co-founder > > >>> > > >>> Semaku B.V. | Torenallee 20 (SFJ 3D) | 5617 BC Eindhoven | T +31 6 > > >>> 42590072 | https://semaku.com/ > > >>> KvK: 58031405 | BTW: NL852842156B01 | IBAN: NL94 INGB 0008 3219 > > 95 >
