Hi John,

Yes, the join of two large subqueries is the issue.

Optimization involves making pragmatic determination. Sometimes it isn't optimal for some data.

Something to consider is detecting these independency of the (?X_i, ?X_j) and (?Y_i, ?Y_j) blocks because ia hash join is likely a better choice. That, or caching part evaluations there is cross-product like effects.

Thank you for the details.

    Andy

See also:

"A Worst-Case Optimal Join Algorithm for SPARQL"
https://aidanhogan.com/docs/SPARQL_worst_case_optimal.pdf

"Leapfrog Triejoin: A Simple, Worst-Case Optimal Join"
Algorithm
https://www.openproceedings.org/2014/conf/icdt/Veldhuizen14.pdf

On 29/03/2024 09:25, John Walker wrote:
I did some more experimentation and checked the query algebra using the 
--explain option.

For sake of simplicity I use a simpler query:

```
select (count(*) as ?C)
where {
   {
     select ?X ?Y (struuid() as ?UUID)
     where {
       values ?X { 0 1 }
       values ?Y { 0 1 }
     }
   }
   {
     select ?X ?Y
     where {
       {
         select ?X ?Y (rand() as ?RAND)
         where {
           values ?X { 0 1 }
           values ?Y { 0 1 }
         }
       }
       filter (?RAND < 0.95)
     }
   }
}
```

For this the algebra is:

```
   (project (?C)
     (extend ((?C ?.0))
       (group () ((?.0 (count)))
         (sequence
           (project (?X ?Y ?UUID)
             (extend ((?UUID (struuid)))
               (sequence
                 (table (vars ?Y)
                   (row [?Y 0])
                   (row [?Y 1])
                 )
                 (table (vars ?X)
                   (row [?X 0])
                   (row [?X 1])
                 ))))
           (project (?X ?Y)
             (project (?X ?Y ?/RAND)
               (filter (< ?/RAND 0.95)
                 (extend ((?/RAND (rand)))
                   (sequence
                     (table (vars ?Y)
                       (row [?Y 0])
                       (row [?Y 1])
                     )
                     (table (vars ?X)
                       (row [?X 0])
                       (row [?X 1])
                     ))))))))))
```

Whilst if I make a small change to also project some other variable from the 
second subquery

```
select (count(*) as ?C)
where {
   {
     select ?X ?Y (struuid() as ?UUID)
     where {
       values ?X { 0 1 }
       values ?Y { 0 1 }
     }
   }
   {
     select ?X ?Y (0 as ?_)
     where {
       {
         select ?X ?Y (rand() as ?RAND)
         where {
           values ?X { 0 1 }
           values ?Y { 0 1 }
         }
       }
       filter (?RAND < 0.95)
     }
   }
}
```

Then the algebra is:

```
   (project (?C)
     (extend ((?C ?.0))
       (group () ((?.0 (count)))
         (join
           (project (?X ?Y ?UUID)
             (extend ((?UUID (struuid)))
               (sequence
                 (table (vars ?Y)
                   (row [?Y 0])
                   (row [?Y 1])
                 )
                 (table (vars ?X)
                   (row [?X 0])
                   (row [?X 1])
                 ))))
           (project (?X ?Y ?_)
             (extend ((?_ 0))
               (project (?X ?Y ?/RAND)
                 (filter (< ?/RAND 0.95)
                   (extend ((?/RAND (rand)))
                     (sequence
                       (table (vars ?Y)
                         (row [?Y 0])
                         (row [?Y 1])
                       )
                       (table (vars ?X)
                         (row [?X 0])
                         (row [?X 1])
                       )))))))))))
```

Note the outermost sequence operator has changed to a join operator.
I don’t understand the logic behind that.

Note that projecting the ?RAND variable from the second query does not force 
the join.

John

-----Original Message-----
From: John Walker <[email protected]>
Sent: Friday, 29 March 2024 08:55
To: [email protected]
Subject: RE: Performance question with joins

I did a bit more experimentation by putting the second subquery inside some
other clauses:

* FILTER EXISTS - no effect
* OPTIONAL - runtime around 0.5s
* MINUS - runtime around 0.5s

So, I assume that the engine is doing some form of nested loop join to iterate
through each solution from the first subquery and test against the second.
Same as what is happening with FILTER EXISTS.

A "hack" to get around this seems to be to add a redundant MINUS {}
between the subqueries.

John

-----Original Message-----
From: John Walker <[email protected]>
Sent: Friday, 29 March 2024 07:58
To: jena-users-ml <[email protected]>
Subject: Performance question with joins

Hi,

I am working with some data representing a 2D Cartesian coordinate
system representing simple grid array “maps”
The X and Y coordinates are represented as integers.

I want to join data from different “layers” in the data.
One layer contains a unique identifier for each position.
The other layer only contains a subset of coordinates.

I have written the following queries to simulate some data to
demonstrate the issue I face.

This query generates a solution field with the XY coordinates and an
identifier for each:

```
select ?X ?Y (struuid() as ?UUID)
where {
   values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
   values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
   bind ( ?X_i + 10 * ?X_j as ?X)
   values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
   values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
   bind ( ?Y_i + 10 * ?Y_j as ?Y)
}
```

This query generates a filtered subset of the XY coordinates:

```
select ?X ?Y
where {
   {
     select ?X ?Y (rand() as ?RAND)
     where {
       values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
       values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
       bind ( ?X_i + 10 * ?X_j as ?X)
       values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
       values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
       bind ( ?Y_i + 10 * ?Y_j as ?Y)
     }
   }
   filter (?RAND < 0.95)
}
```

Now if I run each of those and write the results out as CSV using the
sparql command line tool, each completes in under half a second.

I can then join the results using Pandas in Python, which takes less
than a second.

However, if I try to do the join within the SPARQL process usings the
following query with subqueries, it takes around 10 seconds:

```
select (count(*) as ?C)
where {
   {
     select ?X ?Y (struuid() as ?UUID)
     where {
       values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
       values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
       bind ( ?X_i + 10 * ?X_j as ?X)
       values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
       values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
       bind ( ?Y_i + 10 * ?Y_j as ?Y)
     }
   }
   {
     select ?X ?Y
     where {
       {
         select ?X ?Y (rand() as ?RAND)
         where {
           values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
           values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
           bind ( ?X_i + 10 * ?X_j as ?X)
           values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
           values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
           bind ( ?Y_i + 10 * ?Y_j as ?Y)
         }
       }
       filter (?RAND < 0.95)
     }
   }
}
```

In practice the “maps” I need to process contain hundreds of thousands
of coordinates and then the final join using Jena becomes intractable
in terms of memory and time.
Whilst if I run each subquery separately and use Pandas for the join,
it completes without breaking a sweat.

I thought SPARQL engines were optimized for joins, so how come Jena
gets absolutely spanked by Pandas here?

Am I missing something when I write the queries?

Regards,

John Walker
Principal Consultant & co-founder

Semaku B.V. | Torenallee 20 (SFJ 3D) | 5617 BC Eindhoven | T +31 6
42590072 | https://semaku.com/
KvK: 58031405 | BTW: NL852842156B01 | IBAN: NL94 INGB 0008 3219 95

Reply via email to