RE: Performance question with joins

John Walker Fri, 29 Mar 2024 02:25:39 -0700

I did some more experimentation and checked the query algebra using the 
--explain option.


For sake of simplicity I use a simpler query:

```
select (count(*) as ?C)
where {
  {
    select ?X ?Y (struuid() as ?UUID)
    where {
      values ?X { 0 1 }
      values ?Y { 0 1 }
    }
  }
  {
    select ?X ?Y
    where {
      {
        select ?X ?Y (rand() as ?RAND)
        where {
          values ?X { 0 1 }
          values ?Y { 0 1 }
        }
      }
      filter (?RAND < 0.95)
    }
  }
}
```

For this the algebra is:

```
  (project (?C)
    (extend ((?C ?.0))
      (group () ((?.0 (count)))
        (sequence
          (project (?X ?Y ?UUID)
            (extend ((?UUID (struuid)))
              (sequence
                (table (vars ?Y)
                  (row [?Y 0])
                  (row [?Y 1])
                )
                (table (vars ?X)
                  (row [?X 0])
                  (row [?X 1])
                ))))
          (project (?X ?Y)
            (project (?X ?Y ?/RAND)
              (filter (< ?/RAND 0.95)
                (extend ((?/RAND (rand)))
                  (sequence
                    (table (vars ?Y)
                      (row [?Y 0])
                      (row [?Y 1])
                    )
                    (table (vars ?X)
                      (row [?X 0])
                      (row [?X 1])
                    ))))))))))
```

Whilst if I make a small change to also project some other variable from the 
second subquery

```
select (count(*) as ?C)
where {
  {
    select ?X ?Y (struuid() as ?UUID)
    where {
      values ?X { 0 1 }
      values ?Y { 0 1 }
    }
  }
  {
    select ?X ?Y (0 as ?_)
    where {
      {
        select ?X ?Y (rand() as ?RAND)
        where {
          values ?X { 0 1 }
          values ?Y { 0 1 }
        }
      }
      filter (?RAND < 0.95)
    }
  }
}
```

Then the algebra is:

```
  (project (?C)
    (extend ((?C ?.0))
      (group () ((?.0 (count)))
        (join
          (project (?X ?Y ?UUID)
            (extend ((?UUID (struuid)))
              (sequence
                (table (vars ?Y)
                  (row [?Y 0])
                  (row [?Y 1])
                )
                (table (vars ?X)
                  (row [?X 0])
                  (row [?X 1])
                ))))
          (project (?X ?Y ?_)
            (extend ((?_ 0))
              (project (?X ?Y ?/RAND)
                (filter (< ?/RAND 0.95)
                  (extend ((?/RAND (rand)))
                    (sequence
                      (table (vars ?Y)
                        (row [?Y 0])
                        (row [?Y 1])
                      )
                      (table (vars ?X)
                        (row [?X 0])
                        (row [?X 1])
                      )))))))))))
```

Note the outermost sequence operator has changed to a join operator.
I don’t understand the logic behind that.

Note that projecting the ?RAND variable from the second query does not force 
the join.

John

> -----Original Message-----
> From: John Walker <[email protected]>
> Sent: Friday, 29 March 2024 08:55
> To: [email protected]
> Subject: RE: Performance question with joins
> 
> I did a bit more experimentation by putting the second subquery inside some
> other clauses:
> 
> * FILTER EXISTS - no effect
> * OPTIONAL - runtime around 0.5s
> * MINUS - runtime around 0.5s
> 
> So, I assume that the engine is doing some form of nested loop join to iterate
> through each solution from the first subquery and test against the second.
> Same as what is happening with FILTER EXISTS.
> 
> A "hack" to get around this seems to be to add a redundant MINUS {}
> between the subqueries.
> 
> John
> 
> > -----Original Message-----
> > From: John Walker <[email protected]>
> > Sent: Friday, 29 March 2024 07:58
> > To: jena-users-ml <[email protected]>
> > Subject: Performance question with joins
> >
> > Hi,
> >
> > I am working with some data representing a 2D Cartesian coordinate
> > system representing simple grid array “maps”
> > The X and Y coordinates are represented as integers.
> >
> > I want to join data from different “layers” in the data.
> > One layer contains a unique identifier for each position.
> > The other layer only contains a subset of coordinates.
> >
> > I have written the following queries to simulate some data to
> > demonstrate the issue I face.
> >
> > This query generates a solution field with the XY coordinates and an
> > identifier for each:
> >
> > ```
> > select ?X ?Y (struuid() as ?UUID)
> > where {
> >   values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> >   values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> >   bind ( ?X_i + 10 * ?X_j as ?X)
> >   values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> >   values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> >   bind ( ?Y_i + 10 * ?Y_j as ?Y)
> > }
> > ```
> >
> > This query generates a filtered subset of the XY coordinates:
> >
> > ```
> > select ?X ?Y
> > where {
> >   {
> >     select ?X ?Y (rand() as ?RAND)
> >     where {
> >       values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> >       values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> >       bind ( ?X_i + 10 * ?X_j as ?X)
> >       values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> >       values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> >       bind ( ?Y_i + 10 * ?Y_j as ?Y)
> >     }
> >   }
> >   filter (?RAND < 0.95)
> > }
> > ```
> >
> > Now if I run each of those and write the results out as CSV using the
> > sparql command line tool, each completes in under half a second.
> >
> > I can then join the results using Pandas in Python, which takes less
> > than a second.
> >
> > However, if I try to do the join within the SPARQL process usings the
> > following query with subqueries, it takes around 10 seconds:
> >
> > ```
> > select (count(*) as ?C)
> > where {
> >   {
> >     select ?X ?Y (struuid() as ?UUID)
> >     where {
> >       values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> >       values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> >       bind ( ?X_i + 10 * ?X_j as ?X)
> >       values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> >       values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> >       bind ( ?Y_i + 10 * ?Y_j as ?Y)
> >     }
> >   }
> >   {
> >     select ?X ?Y
> >     where {
> >       {
> >         select ?X ?Y (rand() as ?RAND)
> >         where {
> >           values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> >           values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> >           bind ( ?X_i + 10 * ?X_j as ?X)
> >           values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> >           values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> >           bind ( ?Y_i + 10 * ?Y_j as ?Y)
> >         }
> >       }
> >       filter (?RAND < 0.95)
> >     }
> >   }
> > }
> > ```
> >
> > In practice the “maps” I need to process contain hundreds of thousands
> > of coordinates and then the final join using Jena becomes intractable
> > in terms of memory and time.
> > Whilst if I run each subquery separately and use Pandas for the join,
> > it completes without breaking a sweat.
> >
> > I thought SPARQL engines were optimized for joins, so how come Jena
> > gets absolutely spanked by Pandas here?
> >
> > Am I missing something when I write the queries?
> >
> > Regards,
> >
> > John Walker
> > Principal Consultant & co-founder
> >
> > Semaku B.V. | Torenallee 20 (SFJ 3D) | 5617 BC Eindhoven | T +31 6
> > 42590072 | https://semaku.com/
> > KvK: 58031405 | BTW: NL852842156B01 | IBAN: NL94 INGB 0008 3219 95

RE: Performance question with joins

Reply via email to