I did some more experimentation and checked the query algebra using the
--explain option.
For sake of simplicity I use a simpler query:
```
select (count(*) as ?C)
where {
{
select ?X ?Y (struuid() as ?UUID)
where {
values ?X { 0 1 }
values ?Y { 0 1 }
}
}
{
select ?X ?Y
where {
{
select ?X ?Y (rand() as ?RAND)
where {
values ?X { 0 1 }
values ?Y { 0 1 }
}
}
filter (?RAND < 0.95)
}
}
}
```
For this the algebra is:
```
(project (?C)
(extend ((?C ?.0))
(group () ((?.0 (count)))
(sequence
(project (?X ?Y ?UUID)
(extend ((?UUID (struuid)))
(sequence
(table (vars ?Y)
(row [?Y 0])
(row [?Y 1])
)
(table (vars ?X)
(row [?X 0])
(row [?X 1])
))))
(project (?X ?Y)
(project (?X ?Y ?/RAND)
(filter (< ?/RAND 0.95)
(extend ((?/RAND (rand)))
(sequence
(table (vars ?Y)
(row [?Y 0])
(row [?Y 1])
)
(table (vars ?X)
(row [?X 0])
(row [?X 1])
))))))))))
```
Whilst if I make a small change to also project some other variable from the
second subquery
```
select (count(*) as ?C)
where {
{
select ?X ?Y (struuid() as ?UUID)
where {
values ?X { 0 1 }
values ?Y { 0 1 }
}
}
{
select ?X ?Y (0 as ?_)
where {
{
select ?X ?Y (rand() as ?RAND)
where {
values ?X { 0 1 }
values ?Y { 0 1 }
}
}
filter (?RAND < 0.95)
}
}
}
```
Then the algebra is:
```
(project (?C)
(extend ((?C ?.0))
(group () ((?.0 (count)))
(join
(project (?X ?Y ?UUID)
(extend ((?UUID (struuid)))
(sequence
(table (vars ?Y)
(row [?Y 0])
(row [?Y 1])
)
(table (vars ?X)
(row [?X 0])
(row [?X 1])
))))
(project (?X ?Y ?_)
(extend ((?_ 0))
(project (?X ?Y ?/RAND)
(filter (< ?/RAND 0.95)
(extend ((?/RAND (rand)))
(sequence
(table (vars ?Y)
(row [?Y 0])
(row [?Y 1])
)
(table (vars ?X)
(row [?X 0])
(row [?X 1])
)))))))))))
```
Note the outermost sequence operator has changed to a join operator.
I don’t understand the logic behind that.
Note that projecting the ?RAND variable from the second query does not force
the join.
John
> -----Original Message-----
> From: John Walker <[email protected]>
> Sent: Friday, 29 March 2024 08:55
> To: [email protected]
> Subject: RE: Performance question with joins
>
> I did a bit more experimentation by putting the second subquery inside some
> other clauses:
>
> * FILTER EXISTS - no effect
> * OPTIONAL - runtime around 0.5s
> * MINUS - runtime around 0.5s
>
> So, I assume that the engine is doing some form of nested loop join to iterate
> through each solution from the first subquery and test against the second.
> Same as what is happening with FILTER EXISTS.
>
> A "hack" to get around this seems to be to add a redundant MINUS {}
> between the subqueries.
>
> John
>
> > -----Original Message-----
> > From: John Walker <[email protected]>
> > Sent: Friday, 29 March 2024 07:58
> > To: jena-users-ml <[email protected]>
> > Subject: Performance question with joins
> >
> > Hi,
> >
> > I am working with some data representing a 2D Cartesian coordinate
> > system representing simple grid array “maps”
> > The X and Y coordinates are represented as integers.
> >
> > I want to join data from different “layers” in the data.
> > One layer contains a unique identifier for each position.
> > The other layer only contains a subset of coordinates.
> >
> > I have written the following queries to simulate some data to
> > demonstrate the issue I face.
> >
> > This query generates a solution field with the XY coordinates and an
> > identifier for each:
> >
> > ```
> > select ?X ?Y (struuid() as ?UUID)
> > where {
> > values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> > values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> > bind ( ?X_i + 10 * ?X_j as ?X)
> > values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> > values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> > bind ( ?Y_i + 10 * ?Y_j as ?Y)
> > }
> > ```
> >
> > This query generates a filtered subset of the XY coordinates:
> >
> > ```
> > select ?X ?Y
> > where {
> > {
> > select ?X ?Y (rand() as ?RAND)
> > where {
> > values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> > values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> > bind ( ?X_i + 10 * ?X_j as ?X)
> > values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> > values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> > bind ( ?Y_i + 10 * ?Y_j as ?Y)
> > }
> > }
> > filter (?RAND < 0.95)
> > }
> > ```
> >
> > Now if I run each of those and write the results out as CSV using the
> > sparql command line tool, each completes in under half a second.
> >
> > I can then join the results using Pandas in Python, which takes less
> > than a second.
> >
> > However, if I try to do the join within the SPARQL process usings the
> > following query with subqueries, it takes around 10 seconds:
> >
> > ```
> > select (count(*) as ?C)
> > where {
> > {
> > select ?X ?Y (struuid() as ?UUID)
> > where {
> > values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> > values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> > bind ( ?X_i + 10 * ?X_j as ?X)
> > values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> > values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> > bind ( ?Y_i + 10 * ?Y_j as ?Y)
> > }
> > }
> > {
> > select ?X ?Y
> > where {
> > {
> > select ?X ?Y (rand() as ?RAND)
> > where {
> > values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> > values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> > bind ( ?X_i + 10 * ?X_j as ?X)
> > values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> > values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> > bind ( ?Y_i + 10 * ?Y_j as ?Y)
> > }
> > }
> > filter (?RAND < 0.95)
> > }
> > }
> > }
> > ```
> >
> > In practice the “maps” I need to process contain hundreds of thousands
> > of coordinates and then the final join using Jena becomes intractable
> > in terms of memory and time.
> > Whilst if I run each subquery separately and use Pandas for the join,
> > it completes without breaking a sweat.
> >
> > I thought SPARQL engines were optimized for joins, so how come Jena
> > gets absolutely spanked by Pandas here?
> >
> > Am I missing something when I write the queries?
> >
> > Regards,
> >
> > John Walker
> > Principal Consultant & co-founder
> >
> > Semaku B.V. | Torenallee 20 (SFJ 3D) | 5617 BC Eindhoven | T +31 6
> > 42590072 | https://semaku.com/
> > KvK: 58031405 | BTW: NL852842156B01 | IBAN: NL94 INGB 0008 3219 95