RE: Performance question with joins

John Walker Mon, 01 Apr 2024 12:18:33 -0700

Hi Andy,

Thanks for the pointers.


I have been digging a bit more into the plans for the actual queries against 
the actual datasets and see that a join is used.

Here's an example of the query:

```
prefix mfg: <http://example.com/def/mfg/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

select distinct ?uid
where {
  { select ?wafer ?uid (?X - ?ref_X as ?X_) (?Y - ?ref_Y as ?Y_) {
    graph ?g_uid {
      [] a mfg:ReferenceDevice ;
        rdfs:label "ref die"^^xsd:string ;
        mfg:location [ mfg:xPos ?ref_X ; mfg:yPos ?ref_Y ] .

      [] mfg:object [ rdfs:label ?uid ] ;
        mfg:location [ mfg:xPos ?X ; mfg:yPos ?Y ; mfg:containedIn ?wafer ]
    }
  }}
  { select ?wafer (?X - ?ref_X as ?X_) (?Y - ?ref_Y as ?Y_) {
    graph ?g_fwt {
      [] a mfg:ReferenceDevice ;
        rdfs:label "Prober Reference"^^xsd:string ;
        mfg:location [ mfg:xPos ?ref_X ; mfg:yPos ?ref_Y ] .

      [] mfg:binCode [ mfg:pick true ] ;
        mfg:location [ mfg:xPos ?X ; mfg:yPos ?Y ; mfg:containedIn ?wafer ]
    }
  }}
}
```

The gist of this is that we try to normalize the coordinates from two maps 
based on a common reference point.

And the algebra generated from running that query against some example data:

```
  (distinct
    (project (?uid)
      (join
        (project (?wafer ?uid ?X_ ?Y_)
          (extend ((?X_ (- ?/X ?/ref_X)) (?Y_ (- ?/Y ?/ref_Y)))
            (graph ?/g_uid
              (bgp
                (triple ?/?0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://example.com/def/mfg/ReferenceDevice>)
                (triple ?/?0 <http://www.w3.org/2000/01/rdf-schema#label> "ref 
die")
                (triple ?/?0 <http://example.com/def/mfg/location> ?/?1)
                (triple ?/?1 <http://example.com/def/mfg/xPos> ?/ref_X)
                (triple ?/?1 <http://example.com/def/mfg/yPos> ?/ref_Y)
                (triple ?/?2 <http://example.com/def/mfg/object> ?/?3)
                (triple ?/?3 <http://www.w3.org/2000/01/rdf-schema#label> ?uid)
                (triple ?/?2 <http://example.com/def/mfg/location> ?/?4)
                (triple ?/?4 <http://example.com/def/mfg/xPos> ?/X)
                (triple ?/?4 <http://example.com/def/mfg/yPos> ?/Y)
                (triple ?/?4 <http://example.com/def/mfg/containedIn> ?wafer)
              ))))
        (project (?wafer ?X_ ?Y_)
          (extend ((?X_ (- ?/X ?/ref_X)) (?Y_ (- ?/Y ?/ref_Y)))
            (graph ?/g_fwt
              (bgp
                (triple ?/?5 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://example.com/def/mfg/ReferenceDevice>)
                (triple ?/?5 <http://www.w3.org/2000/01/rdf-schema#label> 
"Prober Reference")
                (triple ?/?5 <http://example.com/def/mfg/location> ?/?6)
                (triple ?/?6 <http://example.com/def/mfg/xPos> ?/ref_X)
                (triple ?/?6 <http://example.com/def/mfg/yPos> ?/ref_Y)
                (triple ?/?7 <http://example.com/def/mfg/binCode> ?/?8)
                (triple ?/?8 <http://example.com/def/mfg/pick> true)
                (triple ?/?7 <http://example.com/def/mfg/location> ?/?9)
                (triple ?/?9 <http://example.com/def/mfg/xPos> ?/X)
                (triple ?/?9 <http://example.com/def/mfg/yPos> ?/Y)
                (triple ?/?9 <http://example.com/def/mfg/containedIn> ?wafer)
              )))))))
```

The query requires over 30 minutes to complete.

The cardinality on the left and right side of the join is around 125k.
So large, but not huge.

If I run the subqueries in isolation, they complete in a reasonable time of ~30 
seconds.
Serializing the results to CSV and doing the join with Pandas takes a few 
seconds.

```
from pandas import read_csv

df_1 = read_csv('subquery_1.csv', encoding = 'utf-8').set_index(['wafer','X_', 
'Y_'])
df_2 = read_csv('subquery_2.csv', encoding = 'utf-8').set_index(['wafer','X_', 
'Y_'])

df_joined = df_1.join(df_2)

print(df_joined)
```

Admittedly I am telling Pandas which fields to index, but I assume that should 
be self-evident from the SPARQL query text.

Does it help to make some example RDF data available to help see if Jena can be 
improved for such use cases, or am I better off just using Panda for that last 
join?

Regards,
John

> -----Original Message-----
> From: Andy Seaborne <[email protected]>
> Sent: Monday, 1 April 2024 19:04
> To: [email protected]
> Subject: Re: Performance question with joins
> 
> Hi John,
> 
> Yes, the join of two large subqueries is the issue.
> 
> Optimization involves making pragmatic determination. Sometimes it isn't
> optimal for some data.
> 
> Something to consider is detecting these independency of the (?X_i,
> ?X_j) and (?Y_i, ?Y_j) blocks because ia hash join is likely a better choice. 
> That,
> or caching part evaluations there is cross-product like effects.
> 
> Thank you for the details.
> 
>      Andy
> 
> See also:
> 
> "A Worst-Case Optimal Join Algorithm for SPARQL"
> https://aidanhogan.com/docs/SPARQL_worst_case_optimal.pdf
> 
> "Leapfrog Triejoin: A Simple, Worst-Case Optimal Join"
> Algorithm
> https://www.openproceedings.org/2014/conf/icdt/Veldhuizen14.pdf
> 
> On 29/03/2024 09:25, John Walker wrote:
> > I did some more experimentation and checked the query algebra using the --
> explain option.
> >
> > For sake of simplicity I use a simpler query:
> >
> > ```
> > select (count(*) as ?C)
> > where {
> >    {
> >      select ?X ?Y (struuid() as ?UUID)
> >      where {
> >        values ?X { 0 1 }
> >        values ?Y { 0 1 }
> >      }
> >    }
> >    {
> >      select ?X ?Y
> >      where {
> >        {
> >          select ?X ?Y (rand() as ?RAND)
> >          where {
> >            values ?X { 0 1 }
> >            values ?Y { 0 1 }
> >          }
> >        }
> >        filter (?RAND < 0.95)
> >      }
> >    }
> > }
> > ```
> >
> > For this the algebra is:
> >
> > ```
> >    (project (?C)
> >      (extend ((?C ?.0))
> >        (group () ((?.0 (count)))
> >          (sequence
> >            (project (?X ?Y ?UUID)
> >              (extend ((?UUID (struuid)))
> >                (sequence
> >                  (table (vars ?Y)
> >                    (row [?Y 0])
> >                    (row [?Y 1])
> >                  )
> >                  (table (vars ?X)
> >                    (row [?X 0])
> >                    (row [?X 1])
> >                  ))))
> >            (project (?X ?Y)
> >              (project (?X ?Y ?/RAND)
> >                (filter (< ?/RAND 0.95)
> >                  (extend ((?/RAND (rand)))
> >                    (sequence
> >                      (table (vars ?Y)
> >                        (row [?Y 0])
> >                        (row [?Y 1])
> >                      )
> >                      (table (vars ?X)
> >                        (row [?X 0])
> >                        (row [?X 1])
> >                      ))))))))))
> > ```
> >
> > Whilst if I make a small change to also project some other variable
> > from the second subquery
> >
> > ```
> > select (count(*) as ?C)
> > where {
> >    {
> >      select ?X ?Y (struuid() as ?UUID)
> >      where {
> >        values ?X { 0 1 }
> >        values ?Y { 0 1 }
> >      }
> >    }
> >    {
> >      select ?X ?Y (0 as ?_)
> >      where {
> >        {
> >          select ?X ?Y (rand() as ?RAND)
> >          where {
> >            values ?X { 0 1 }
> >            values ?Y { 0 1 }
> >          }
> >        }
> >        filter (?RAND < 0.95)
> >      }
> >    }
> > }
> > ```
> >
> > Then the algebra is:
> >
> > ```
> >    (project (?C)
> >      (extend ((?C ?.0))
> >        (group () ((?.0 (count)))
> >          (join
> >            (project (?X ?Y ?UUID)
> >              (extend ((?UUID (struuid)))
> >                (sequence
> >                  (table (vars ?Y)
> >                    (row [?Y 0])
> >                    (row [?Y 1])
> >                  )
> >                  (table (vars ?X)
> >                    (row [?X 0])
> >                    (row [?X 1])
> >                  ))))
> >            (project (?X ?Y ?_)
> >              (extend ((?_ 0))
> >                (project (?X ?Y ?/RAND)
> >                  (filter (< ?/RAND 0.95)
> >                    (extend ((?/RAND (rand)))
> >                      (sequence
> >                        (table (vars ?Y)
> >                          (row [?Y 0])
> >                          (row [?Y 1])
> >                        )
> >                        (table (vars ?X)
> >                          (row [?X 0])
> >                          (row [?X 1])
> >                        )))))))))))
> > ```
> >
> > Note the outermost sequence operator has changed to a join operator.
> > I don’t understand the logic behind that.
> >
> > Note that projecting the ?RAND variable from the second query does not
> force the join.
> >
> > John
> >
> >> -----Original Message-----
> >> From: John Walker <[email protected]>
> >> Sent: Friday, 29 March 2024 08:55
> >> To: [email protected]
> >> Subject: RE: Performance question with joins
> >>
> >> I did a bit more experimentation by putting the second subquery
> >> inside some other clauses:
> >>
> >> * FILTER EXISTS - no effect
> >> * OPTIONAL - runtime around 0.5s
> >> * MINUS - runtime around 0.5s
> >>
> >> So, I assume that the engine is doing some form of nested loop join
> >> to iterate through each solution from the first subquery and test against
> the second.
> >> Same as what is happening with FILTER EXISTS.
> >>
> >> A "hack" to get around this seems to be to add a redundant MINUS {}
> >> between the subqueries.
> >>
> >> John
> >>
> >>> -----Original Message-----
> >>> From: John Walker <[email protected]>
> >>> Sent: Friday, 29 March 2024 07:58
> >>> To: jena-users-ml <[email protected]>
> >>> Subject: Performance question with joins
> >>>
> >>> Hi,
> >>>
> >>> I am working with some data representing a 2D Cartesian coordinate
> >>> system representing simple grid array “maps”
> >>> The X and Y coordinates are represented as integers.
> >>>
> >>> I want to join data from different “layers” in the data.
> >>> One layer contains a unique identifier for each position.
> >>> The other layer only contains a subset of coordinates.
> >>>
> >>> I have written the following queries to simulate some data to
> >>> demonstrate the issue I face.
> >>>
> >>> This query generates a solution field with the XY coordinates and an
> >>> identifier for each:
> >>>
> >>> ```
> >>> select ?X ?Y (struuid() as ?UUID)
> >>> where {
> >>>    values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> >>>    values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> >>>    bind ( ?X_i + 10 * ?X_j as ?X)
> >>>    values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> >>>    values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> >>>    bind ( ?Y_i + 10 * ?Y_j as ?Y)
> >>> }
> >>> ```
> >>>
> >>> This query generates a filtered subset of the XY coordinates:
> >>>
> >>> ```
> >>> select ?X ?Y
> >>> where {
> >>>    {
> >>>      select ?X ?Y (rand() as ?RAND)
> >>>      where {
> >>>        values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> >>>        values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> >>>        bind ( ?X_i + 10 * ?X_j as ?X)
> >>>        values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> >>>        values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> >>>        bind ( ?Y_i + 10 * ?Y_j as ?Y)
> >>>      }
> >>>    }
> >>>    filter (?RAND < 0.95)
> >>> }
> >>> ```
> >>>
> >>> Now if I run each of those and write the results out as CSV using
> >>> the sparql command line tool, each completes in under half a second.
> >>>
> >>> I can then join the results using Pandas in Python, which takes less
> >>> than a second.
> >>>
> >>> However, if I try to do the join within the SPARQL process usings
> >>> the following query with subqueries, it takes around 10 seconds:
> >>>
> >>> ```
> >>> select (count(*) as ?C)
> >>> where {
> >>>    {
> >>>      select ?X ?Y (struuid() as ?UUID)
> >>>      where {
> >>>        values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> >>>        values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> >>>        bind ( ?X_i + 10 * ?X_j as ?X)
> >>>        values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> >>>        values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> >>>        bind ( ?Y_i + 10 * ?Y_j as ?Y)
> >>>      }
> >>>    }
> >>>    {
> >>>      select ?X ?Y
> >>>      where {
> >>>        {
> >>>          select ?X ?Y (rand() as ?RAND)
> >>>          where {
> >>>            values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> >>>            values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> >>>            bind ( ?X_i + 10 * ?X_j as ?X)
> >>>            values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> >>>            values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> >>>            bind ( ?Y_i + 10 * ?Y_j as ?Y)
> >>>          }
> >>>        }
> >>>        filter (?RAND < 0.95)
> >>>      }
> >>>    }
> >>> }
> >>> ```
> >>>
> >>> In practice the “maps” I need to process contain hundreds of
> >>> thousands of coordinates and then the final join using Jena becomes
> >>> intractable in terms of memory and time.
> >>> Whilst if I run each subquery separately and use Pandas for the
> >>> join, it completes without breaking a sweat.
> >>>
> >>> I thought SPARQL engines were optimized for joins, so how come Jena
> >>> gets absolutely spanked by Pandas here?
> >>>
> >>> Am I missing something when I write the queries?
> >>>
> >>> Regards,
> >>>
> >>> John Walker
> >>> Principal Consultant & co-founder
> >>>
> >>> Semaku B.V. | Torenallee 20 (SFJ 3D) | 5617 BC Eindhoven | T +31 6
> >>> 42590072 | https://semaku.com/
> >>> KvK: 58031405 | BTW: NL852842156B01 | IBAN: NL94 INGB 0008 3219
> 95

RE: Performance question with joins

Reply via email to