Hi,
according to the hash join implementation in Jena in class
AbstractIterHashJoin a join key is created via line
joinKey = JoinKey.createVarKey(varsLeft, varsRight) ;
That method does take only the first variable in both bindings as join
key instead of all matching variables. In our case that would probably
be ?wafer I guess?
Is that a trade-off between hashing costs and join-performance @Andy?
The join key (wafer) sounds like the least selective one for joining
data of both subqueries compared to (x) or (y) or even better (x, y) .
What happens if you try to generate an artificial key by concatenating
the variables to a single one? Probably doesn't change anything, but who
knows?
```
prefix mfg: <http://example.com/def/mfg/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
select distinct ?uid
where {
{ select (CONCAT( STR(?wafer), STR(?X - ?ref_X), STR(?Y - ?ref_Y) )
AS ?key) ?uid {
graph ?g_uid {
[] a mfg:ReferenceDevice ;
rdfs:label "ref die"^^xsd:string ;
mfg:location [ mfg:xPos ?ref_X ; mfg:yPos ?ref_Y ] .
[] mfg:object [ rdfs:label ?uid ] ;
mfg:location [ mfg:xPos ?X ; mfg:yPos ?Y ; mfg:containedIn ?wafer ]
}
}}
{ select (CONCAT( STR(?wafer), STR(?X - ?ref_X), STR(?Y - ?ref_Y) )
AS ?key) {
graph ?g_fwt {
[] a mfg:ReferenceDevice ;
rdfs:label "Prober Reference"^^xsd:string ;
mfg:location [ mfg:xPos ?ref_X ; mfg:yPos ?ref_Y ] .
[] mfg:binCode [ mfg:pick true ] ;
mfg:location [ mfg:xPos ?X ; mfg:yPos ?Y ; mfg:containedIn ?wafer ]
}
}}
}
```
Cheers,
Lorenz
On 01.04.24 21:18, John Walker wrote:
Hi Andy,
Thanks for the pointers.
I have been digging a bit more into the plans for the actual queries against
the actual datasets and see that a join is used.
Here's an example of the query:
```
prefix mfg: <http://example.com/def/mfg/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
select distinct ?uid
where {
{ select ?wafer ?uid (?X - ?ref_X as ?X_) (?Y - ?ref_Y as ?Y_) {
graph ?g_uid {
[] a mfg:ReferenceDevice ;
rdfs:label "ref die"^^xsd:string ;
mfg:location [ mfg:xPos ?ref_X ; mfg:yPos ?ref_Y ] .
[] mfg:object [ rdfs:label ?uid ] ;
mfg:location [ mfg:xPos ?X ; mfg:yPos ?Y ; mfg:containedIn ?wafer ]
}
}}
{ select ?wafer (?X - ?ref_X as ?X_) (?Y - ?ref_Y as ?Y_) {
graph ?g_fwt {
[] a mfg:ReferenceDevice ;
rdfs:label "Prober Reference"^^xsd:string ;
mfg:location [ mfg:xPos ?ref_X ; mfg:yPos ?ref_Y ] .
[] mfg:binCode [ mfg:pick true ] ;
mfg:location [ mfg:xPos ?X ; mfg:yPos ?Y ; mfg:containedIn ?wafer ]
}
}}
}
```
The gist of this is that we try to normalize the coordinates from two maps
based on a common reference point.
And the algebra generated from running that query against some example data:
```
(distinct
(project (?uid)
(join
(project (?wafer ?uid ?X_ ?Y_)
(extend ((?X_ (- ?/X ?/ref_X)) (?Y_ (- ?/Y ?/ref_Y)))
(graph ?/g_uid
(bgp
(triple ?/?0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://example.com/def/mfg/ReferenceDevice>)
(triple ?/?0 <http://www.w3.org/2000/01/rdf-schema#label> "ref
die")
(triple ?/?0 <http://example.com/def/mfg/location> ?/?1)
(triple ?/?1 <http://example.com/def/mfg/xPos> ?/ref_X)
(triple ?/?1 <http://example.com/def/mfg/yPos> ?/ref_Y)
(triple ?/?2 <http://example.com/def/mfg/object> ?/?3)
(triple ?/?3 <http://www.w3.org/2000/01/rdf-schema#label> ?uid)
(triple ?/?2 <http://example.com/def/mfg/location> ?/?4)
(triple ?/?4 <http://example.com/def/mfg/xPos> ?/X)
(triple ?/?4 <http://example.com/def/mfg/yPos> ?/Y)
(triple ?/?4 <http://example.com/def/mfg/containedIn> ?wafer)
))))
(project (?wafer ?X_ ?Y_)
(extend ((?X_ (- ?/X ?/ref_X)) (?Y_ (- ?/Y ?/ref_Y)))
(graph ?/g_fwt
(bgp
(triple ?/?5 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://example.com/def/mfg/ReferenceDevice>)
(triple ?/?5 <http://www.w3.org/2000/01/rdf-schema#label> "Prober
Reference")
(triple ?/?5 <http://example.com/def/mfg/location> ?/?6)
(triple ?/?6 <http://example.com/def/mfg/xPos> ?/ref_X)
(triple ?/?6 <http://example.com/def/mfg/yPos> ?/ref_Y)
(triple ?/?7 <http://example.com/def/mfg/binCode> ?/?8)
(triple ?/?8 <http://example.com/def/mfg/pick> true)
(triple ?/?7 <http://example.com/def/mfg/location> ?/?9)
(triple ?/?9 <http://example.com/def/mfg/xPos> ?/X)
(triple ?/?9 <http://example.com/def/mfg/yPos> ?/Y)
(triple ?/?9 <http://example.com/def/mfg/containedIn> ?wafer)
)))))))
```
The query requires over 30 minutes to complete.
The cardinality on the left and right side of the join is around 125k.
So large, but not huge.
If I run the subqueries in isolation, they complete in a reasonable time of ~30
seconds.
Serializing the results to CSV and doing the join with Pandas takes a few
seconds.
```
from pandas import read_csv
df_1 = read_csv('subquery_1.csv', encoding = 'utf-8').set_index(['wafer','X_',
'Y_'])
df_2 = read_csv('subquery_2.csv', encoding = 'utf-8').set_index(['wafer','X_',
'Y_'])
df_joined = df_1.join(df_2)
print(df_joined)
```
Admittedly I am telling Pandas which fields to index, but I assume that should
be self-evident from the SPARQL query text.
Does it help to make some example RDF data available to help see if Jena can be
improved for such use cases, or am I better off just using Panda for that last
join?
Regards,
John
-----Original Message-----
From: Andy Seaborne <[email protected]>
Sent: Monday, 1 April 2024 19:04
To: [email protected]
Subject: Re: Performance question with joins
Hi John,
Yes, the join of two large subqueries is the issue.
Optimization involves making pragmatic determination. Sometimes it isn't
optimal for some data.
Something to consider is detecting these independency of the (?X_i,
?X_j) and (?Y_i, ?Y_j) blocks because ia hash join is likely a better choice.
That,
or caching part evaluations there is cross-product like effects.
Thank you for the details.
Andy
See also:
"A Worst-Case Optimal Join Algorithm for SPARQL"
https://aidanhogan.com/docs/SPARQL_worst_case_optimal.pdf
"Leapfrog Triejoin: A Simple, Worst-Case Optimal Join"
Algorithm
https://www.openproceedings.org/2014/conf/icdt/Veldhuizen14.pdf
On 29/03/2024 09:25, John Walker wrote:
I did some more experimentation and checked the query algebra using the --
explain option.
For sake of simplicity I use a simpler query:
```
select (count(*) as ?C)
where {
{
select ?X ?Y (struuid() as ?UUID)
where {
values ?X { 0 1 }
values ?Y { 0 1 }
}
}
{
select ?X ?Y
where {
{
select ?X ?Y (rand() as ?RAND)
where {
values ?X { 0 1 }
values ?Y { 0 1 }
}
}
filter (?RAND < 0.95)
}
}
}
```
For this the algebra is:
```
(project (?C)
(extend ((?C ?.0))
(group () ((?.0 (count)))
(sequence
(project (?X ?Y ?UUID)
(extend ((?UUID (struuid)))
(sequence
(table (vars ?Y)
(row [?Y 0])
(row [?Y 1])
)
(table (vars ?X)
(row [?X 0])
(row [?X 1])
))))
(project (?X ?Y)
(project (?X ?Y ?/RAND)
(filter (< ?/RAND 0.95)
(extend ((?/RAND (rand)))
(sequence
(table (vars ?Y)
(row [?Y 0])
(row [?Y 1])
)
(table (vars ?X)
(row [?X 0])
(row [?X 1])
))))))))))
```
Whilst if I make a small change to also project some other variable
from the second subquery
```
select (count(*) as ?C)
where {
{
select ?X ?Y (struuid() as ?UUID)
where {
values ?X { 0 1 }
values ?Y { 0 1 }
}
}
{
select ?X ?Y (0 as ?_)
where {
{
select ?X ?Y (rand() as ?RAND)
where {
values ?X { 0 1 }
values ?Y { 0 1 }
}
}
filter (?RAND < 0.95)
}
}
}
```
Then the algebra is:
```
(project (?C)
(extend ((?C ?.0))
(group () ((?.0 (count)))
(join
(project (?X ?Y ?UUID)
(extend ((?UUID (struuid)))
(sequence
(table (vars ?Y)
(row [?Y 0])
(row [?Y 1])
)
(table (vars ?X)
(row [?X 0])
(row [?X 1])
))))
(project (?X ?Y ?_)
(extend ((?_ 0))
(project (?X ?Y ?/RAND)
(filter (< ?/RAND 0.95)
(extend ((?/RAND (rand)))
(sequence
(table (vars ?Y)
(row [?Y 0])
(row [?Y 1])
)
(table (vars ?X)
(row [?X 0])
(row [?X 1])
)))))))))))
```
Note the outermost sequence operator has changed to a join operator.
I don’t understand the logic behind that.
Note that projecting the ?RAND variable from the second query does not
force the join.
John
-----Original Message-----
From: John Walker <[email protected]>
Sent: Friday, 29 March 2024 08:55
To: [email protected]
Subject: RE: Performance question with joins
I did a bit more experimentation by putting the second subquery
inside some other clauses:
* FILTER EXISTS - no effect
* OPTIONAL - runtime around 0.5s
* MINUS - runtime around 0.5s
So, I assume that the engine is doing some form of nested loop join
to iterate through each solution from the first subquery and test against
the second.
Same as what is happening with FILTER EXISTS.
A "hack" to get around this seems to be to add a redundant MINUS {}
between the subqueries.
John
-----Original Message-----
From: John Walker <[email protected]>
Sent: Friday, 29 March 2024 07:58
To: jena-users-ml <[email protected]>
Subject: Performance question with joins
Hi,
I am working with some data representing a 2D Cartesian coordinate
system representing simple grid array “maps”
The X and Y coordinates are represented as integers.
I want to join data from different “layers” in the data.
One layer contains a unique identifier for each position.
The other layer only contains a subset of coordinates.
I have written the following queries to simulate some data to
demonstrate the issue I face.
This query generates a solution field with the XY coordinates and an
identifier for each:
```
select ?X ?Y (struuid() as ?UUID)
where {
values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
bind ( ?X_i + 10 * ?X_j as ?X)
values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
bind ( ?Y_i + 10 * ?Y_j as ?Y)
}
```
This query generates a filtered subset of the XY coordinates:
```
select ?X ?Y
where {
{
select ?X ?Y (rand() as ?RAND)
where {
values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
bind ( ?X_i + 10 * ?X_j as ?X)
values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
bind ( ?Y_i + 10 * ?Y_j as ?Y)
}
}
filter (?RAND < 0.95)
}
```
Now if I run each of those and write the results out as CSV using
the sparql command line tool, each completes in under half a second.
I can then join the results using Pandas in Python, which takes less
than a second.
However, if I try to do the join within the SPARQL process usings
the following query with subqueries, it takes around 10 seconds:
```
select (count(*) as ?C)
where {
{
select ?X ?Y (struuid() as ?UUID)
where {
values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
bind ( ?X_i + 10 * ?X_j as ?X)
values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
bind ( ?Y_i + 10 * ?Y_j as ?Y)
}
}
{
select ?X ?Y
where {
{
select ?X ?Y (rand() as ?RAND)
where {
values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
bind ( ?X_i + 10 * ?X_j as ?X)
values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
bind ( ?Y_i + 10 * ?Y_j as ?Y)
}
}
filter (?RAND < 0.95)
}
}
}
```
In practice the “maps” I need to process contain hundreds of
thousands of coordinates and then the final join using Jena becomes
intractable in terms of memory and time.
Whilst if I run each subquery separately and use Pandas for the
join, it completes without breaking a sweat.
I thought SPARQL engines were optimized for joins, so how come Jena
gets absolutely spanked by Pandas here?
Am I missing something when I write the queries?
Regards,
John Walker
Principal Consultant & co-founder
Semaku B.V. | Torenallee 20 (SFJ 3D) | 5617 BC Eindhoven | T +31 6
42590072 | https://semaku.com/
KvK: 58031405 | BTW: NL852842156B01 | IBAN: NL94 INGB 0008 3219
95
--
Lorenz Bühmann
Research Associate/Scientific Developer
Email [email protected]
Institute for Applied Informatics e.V. (InfAI) | Goerdelerring 9 | 04109
Leipzig | Germany