Performance question with joins

John Walker Thu, 28 Mar 2024 23:58:38 -0700

Hi,

I am working with some data representing a 2D Cartesian coordinate system 
representing simple grid array “maps”
The X and Y coordinates are represented as integers.


I want to join data from different “layers” in the data.
One layer contains a unique identifier for each position.
The other layer only contains a subset of coordinates.

I have written the following queries to simulate some data to demonstrate the 
issue I face.

This query generates a solution field with the XY coordinates and an identifier 
for each:

```
select ?X ?Y (struuid() as ?UUID)
where {
  values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
  values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
  bind ( ?X_i + 10 * ?X_j as ?X)
  values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
  values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
  bind ( ?Y_i + 10 * ?Y_j as ?Y)
}
```

This query generates a filtered subset of the XY coordinates:

```
select ?X ?Y
where {
  {
    select ?X ?Y (rand() as ?RAND)
    where {
      values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
      values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
      bind ( ?X_i + 10 * ?X_j as ?X)
      values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
      values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
      bind ( ?Y_i + 10 * ?Y_j as ?Y)
    }
  }
  filter (?RAND < 0.95)
}
```

Now if I run each of those and write the results out as CSV using the sparql 
command line tool, each completes in under half a second.

I can then join the results using Pandas in Python, which takes less than a 
second.

However, if I try to do the join within the SPARQL process usings the following 
query with subqueries, it takes around 10 seconds:

```
select (count(*) as ?C)
where {
  {
    select ?X ?Y (struuid() as ?UUID)
    where {
      values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
      values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
      bind ( ?X_i + 10 * ?X_j as ?X)
      values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
      values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
      bind ( ?Y_i + 10 * ?Y_j as ?Y)
    }
  }
  {
    select ?X ?Y
    where {
      {
        select ?X ?Y (rand() as ?RAND)
        where {
          values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
          values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
          bind ( ?X_i + 10 * ?X_j as ?X)
          values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
          values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
          bind ( ?Y_i + 10 * ?Y_j as ?Y)
        }
      }
      filter (?RAND < 0.95)
    }
  }
}
```

In practice the “maps” I need to process contain hundreds of thousands of 
coordinates and then the final join using Jena becomes intractable in terms of 
memory and time.
Whilst if I run each subquery separately and use Pandas for the join, it 
completes without breaking a sweat.

I thought SPARQL engines were optimized for joins, so how come Jena gets 
absolutely spanked by Pandas here?

Am I missing something when I write the queries?

Regards,

John Walker
Principal Consultant & co-founder

Semaku B.V. | Torenallee 20 (SFJ 3D) | 5617 BC Eindhoven | T +31 6 42590072 | 
https://semaku.com/
KvK: 58031405 | BTW: NL852842156B01 | IBAN: NL94 INGB 0008 3219 95

Performance question with joins

Reply via email to