I did a bit more experimentation by putting the second subquery inside some 
other clauses:

* FILTER EXISTS - no effect
* OPTIONAL - runtime around 0.5s
* MINUS - runtime around 0.5s

So, I assume that the engine is doing some form of nested loop join to iterate 
through each solution from the first subquery and test against the second.
Same as what is happening with FILTER EXISTS.

A "hack" to get around this seems to be to add a redundant MINUS {} between the 
subqueries.

John

> -----Original Message-----
> From: John Walker <[email protected]>
> Sent: Friday, 29 March 2024 07:58
> To: jena-users-ml <[email protected]>
> Subject: Performance question with joins
> 
> Hi,
> 
> I am working with some data representing a 2D Cartesian coordinate system
> representing simple grid array “maps”
> The X and Y coordinates are represented as integers.
> 
> I want to join data from different “layers” in the data.
> One layer contains a unique identifier for each position.
> The other layer only contains a subset of coordinates.
> 
> I have written the following queries to simulate some data to demonstrate the
> issue I face.
> 
> This query generates a solution field with the XY coordinates and an 
> identifier
> for each:
> 
> ```
> select ?X ?Y (struuid() as ?UUID)
> where {
>   values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
>   values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
>   bind ( ?X_i + 10 * ?X_j as ?X)
>   values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
>   values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
>   bind ( ?Y_i + 10 * ?Y_j as ?Y)
> }
> ```
> 
> This query generates a filtered subset of the XY coordinates:
> 
> ```
> select ?X ?Y
> where {
>   {
>     select ?X ?Y (rand() as ?RAND)
>     where {
>       values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
>       values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
>       bind ( ?X_i + 10 * ?X_j as ?X)
>       values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
>       values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
>       bind ( ?Y_i + 10 * ?Y_j as ?Y)
>     }
>   }
>   filter (?RAND < 0.95)
> }
> ```
> 
> Now if I run each of those and write the results out as CSV using the sparql
> command line tool, each completes in under half a second.
> 
> I can then join the results using Pandas in Python, which takes less than a
> second.
> 
> However, if I try to do the join within the SPARQL process usings the 
> following
> query with subqueries, it takes around 10 seconds:
> 
> ```
> select (count(*) as ?C)
> where {
>   {
>     select ?X ?Y (struuid() as ?UUID)
>     where {
>       values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
>       values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
>       bind ( ?X_i + 10 * ?X_j as ?X)
>       values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
>       values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
>       bind ( ?Y_i + 10 * ?Y_j as ?Y)
>     }
>   }
>   {
>     select ?X ?Y
>     where {
>       {
>         select ?X ?Y (rand() as ?RAND)
>         where {
>           values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
>           values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
>           bind ( ?X_i + 10 * ?X_j as ?X)
>           values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
>           values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
>           bind ( ?Y_i + 10 * ?Y_j as ?Y)
>         }
>       }
>       filter (?RAND < 0.95)
>     }
>   }
> }
> ```
> 
> In practice the “maps” I need to process contain hundreds of thousands of
> coordinates and then the final join using Jena becomes intractable in terms of
> memory and time.
> Whilst if I run each subquery separately and use Pandas for the join, it
> completes without breaking a sweat.
> 
> I thought SPARQL engines were optimized for joins, so how come Jena gets
> absolutely spanked by Pandas here?
> 
> Am I missing something when I write the queries?
> 
> Regards,
> 
> John Walker
> Principal Consultant & co-founder
> 
> Semaku B.V. | Torenallee 20 (SFJ 3D) | 5617 BC Eindhoven | T +31 6
> 42590072 | https://semaku.com/
> KvK: 58031405 | BTW: NL852842156B01 | IBAN: NL94 INGB 0008 3219 95

Reply via email to