Re: Performance question with joins

David Habgood Mon, 01 Apr 2024 15:59:22 -0700

Hi John,

I'm keen to have a go rewriting the query if you're able to make some data
available / confirm the expected result.


Thanks

On Tue, Apr 2, 2024 at 5:18 AM John Walker <[email protected]>
wrote:

> Hi Andy,
>
> Thanks for the pointers.
>
> I have been digging a bit more into the plans for the actual queries
> against the actual datasets and see that a join is used.
>
> Here's an example of the query:
>
> ```
> prefix mfg: <http://example.com/def/mfg/>
> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> prefix xsd: <http://www.w3.org/2001/XMLSchema#>
>
> select distinct ?uid
> where {
>   { select ?wafer ?uid (?X - ?ref_X as ?X_) (?Y - ?ref_Y as ?Y_) {
>     graph ?g_uid {
>       [] a mfg:ReferenceDevice ;
>         rdfs:label "ref die"^^xsd:string ;
>         mfg:location [ mfg:xPos ?ref_X ; mfg:yPos ?ref_Y ] .
>
>       [] mfg:object [ rdfs:label ?uid ] ;
>         mfg:location [ mfg:xPos ?X ; mfg:yPos ?Y ; mfg:containedIn ?wafer ]
>     }
>   }}
>   { select ?wafer (?X - ?ref_X as ?X_) (?Y - ?ref_Y as ?Y_) {
>     graph ?g_fwt {
>       [] a mfg:ReferenceDevice ;
>         rdfs:label "Prober Reference"^^xsd:string ;
>         mfg:location [ mfg:xPos ?ref_X ; mfg:yPos ?ref_Y ] .
>
>       [] mfg:binCode [ mfg:pick true ] ;
>         mfg:location [ mfg:xPos ?X ; mfg:yPos ?Y ; mfg:containedIn ?wafer ]
>     }
>   }}
> }
> ```
>
> The gist of this is that we try to normalize the coordinates from two maps
> based on a common reference point.
>
> And the algebra generated from running that query against some example
> data:
>
> ```
>   (distinct
>     (project (?uid)
>       (join
>         (project (?wafer ?uid ?X_ ?Y_)
>           (extend ((?X_ (- ?/X ?/ref_X)) (?Y_ (- ?/Y ?/ref_Y)))
>             (graph ?/g_uid
>               (bgp
>                 (triple ?/?0 <
> http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
> http://example.com/def/mfg/ReferenceDevice>)
>                 (triple ?/?0 <http://www.w3.org/2000/01/rdf-schema#label>
> "ref die")
>                 (triple ?/?0 <http://example.com/def/mfg/location> ?/?1)
>                 (triple ?/?1 <http://example.com/def/mfg/xPos> ?/ref_X)
>                 (triple ?/?1 <http://example.com/def/mfg/yPos> ?/ref_Y)
>                 (triple ?/?2 <http://example.com/def/mfg/object> ?/?3)
>                 (triple ?/?3 <http://www.w3.org/2000/01/rdf-schema#label>
> ?uid)
>                 (triple ?/?2 <http://example.com/def/mfg/location> ?/?4)
>                 (triple ?/?4 <http://example.com/def/mfg/xPos> ?/X)
>                 (triple ?/?4 <http://example.com/def/mfg/yPos> ?/Y)
>                 (triple ?/?4 <http://example.com/def/mfg/containedIn>
> ?wafer)
>               ))))
>         (project (?wafer ?X_ ?Y_)
>           (extend ((?X_ (- ?/X ?/ref_X)) (?Y_ (- ?/Y ?/ref_Y)))
>             (graph ?/g_fwt
>               (bgp
>                 (triple ?/?5 <
> http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
> http://example.com/def/mfg/ReferenceDevice>)
>                 (triple ?/?5 <http://www.w3.org/2000/01/rdf-schema#label>
> "Prober Reference")
>                 (triple ?/?5 <http://example.com/def/mfg/location> ?/?6)
>                 (triple ?/?6 <http://example.com/def/mfg/xPos> ?/ref_X)
>                 (triple ?/?6 <http://example.com/def/mfg/yPos> ?/ref_Y)
>                 (triple ?/?7 <http://example.com/def/mfg/binCode> ?/?8)
>                 (triple ?/?8 <http://example.com/def/mfg/pick> true)
>                 (triple ?/?7 <http://example.com/def/mfg/location> ?/?9)
>                 (triple ?/?9 <http://example.com/def/mfg/xPos> ?/X)
>                 (triple ?/?9 <http://example.com/def/mfg/yPos> ?/Y)
>                 (triple ?/?9 <http://example.com/def/mfg/containedIn>
> ?wafer)
>               )))))))
> ```
>
> The query requires over 30 minutes to complete.
>
> The cardinality on the left and right side of the join is around 125k.
> So large, but not huge.
>
> If I run the subqueries in isolation, they complete in a reasonable time
> of ~30 seconds.
> Serializing the results to CSV and doing the join with Pandas takes a few
> seconds.
>
> ```
> from pandas import read_csv
>
> df_1 = read_csv('subquery_1.csv', encoding =
> 'utf-8').set_index(['wafer','X_', 'Y_'])
> df_2 = read_csv('subquery_2.csv', encoding =
> 'utf-8').set_index(['wafer','X_', 'Y_'])
>
> df_joined = df_1.join(df_2)
>
> print(df_joined)
> ```
>
> Admittedly I am telling Pandas which fields to index, but I assume that
> should be self-evident from the SPARQL query text.
>
> Does it help to make some example RDF data available to help see if Jena
> can be improved for such use cases, or am I better off just using Panda for
> that last join?
>
> Regards,
> John
>
> > -----Original Message-----
> > From: Andy Seaborne <[email protected]>
> > Sent: Monday, 1 April 2024 19:04
> > To: [email protected]
> > Subject: Re: Performance question with joins
> >
> > Hi John,
> >
> > Yes, the join of two large subqueries is the issue.
> >
> > Optimization involves making pragmatic determination. Sometimes it isn't
> > optimal for some data.
> >
> > Something to consider is detecting these independency of the (?X_i,
> > ?X_j) and (?Y_i, ?Y_j) blocks because ia hash join is likely a better
> choice. That,
> > or caching part evaluations there is cross-product like effects.
> >
> > Thank you for the details.
> >
> >      Andy
> >
> > See also:
> >
> > "A Worst-Case Optimal Join Algorithm for SPARQL"
> > https://aidanhogan.com/docs/SPARQL_worst_case_optimal.pdf
> >
> > "Leapfrog Triejoin: A Simple, Worst-Case Optimal Join"
> > Algorithm
> > https://www.openproceedings.org/2014/conf/icdt/Veldhuizen14.pdf
> >
> > On 29/03/2024 09:25, John Walker wrote:
> > > I did some more experimentation and checked the query algebra using
> the --
> > explain option.
> > >
> > > For sake of simplicity I use a simpler query:
> > >
> > > ```
> > > select (count(*) as ?C)
> > > where {
> > >    {
> > >      select ?X ?Y (struuid() as ?UUID)
> > >      where {
> > >        values ?X { 0 1 }
> > >        values ?Y { 0 1 }
> > >      }
> > >    }
> > >    {
> > >      select ?X ?Y
> > >      where {
> > >        {
> > >          select ?X ?Y (rand() as ?RAND)
> > >          where {
> > >            values ?X { 0 1 }
> > >            values ?Y { 0 1 }
> > >          }
> > >        }
> > >        filter (?RAND < 0.95)
> > >      }
> > >    }
> > > }
> > > ```
> > >
> > > For this the algebra is:
> > >
> > > ```
> > >    (project (?C)
> > >      (extend ((?C ?.0))
> > >        (group () ((?.0 (count)))
> > >          (sequence
> > >            (project (?X ?Y ?UUID)
> > >              (extend ((?UUID (struuid)))
> > >                (sequence
> > >                  (table (vars ?Y)
> > >                    (row [?Y 0])
> > >                    (row [?Y 1])
> > >                  )
> > >                  (table (vars ?X)
> > >                    (row [?X 0])
> > >                    (row [?X 1])
> > >                  ))))
> > >            (project (?X ?Y)
> > >              (project (?X ?Y ?/RAND)
> > >                (filter (< ?/RAND 0.95)
> > >                  (extend ((?/RAND (rand)))
> > >                    (sequence
> > >                      (table (vars ?Y)
> > >                        (row [?Y 0])
> > >                        (row [?Y 1])
> > >                      )
> > >                      (table (vars ?X)
> > >                        (row [?X 0])
> > >                        (row [?X 1])
> > >                      ))))))))))
> > > ```
> > >
> > > Whilst if I make a small change to also project some other variable
> > > from the second subquery
> > >
> > > ```
> > > select (count(*) as ?C)
> > > where {
> > >    {
> > >      select ?X ?Y (struuid() as ?UUID)
> > >      where {
> > >        values ?X { 0 1 }
> > >        values ?Y { 0 1 }
> > >      }
> > >    }
> > >    {
> > >      select ?X ?Y (0 as ?_)
> > >      where {
> > >        {
> > >          select ?X ?Y (rand() as ?RAND)
> > >          where {
> > >            values ?X { 0 1 }
> > >            values ?Y { 0 1 }
> > >          }
> > >        }
> > >        filter (?RAND < 0.95)
> > >      }
> > >    }
> > > }
> > > ```
> > >
> > > Then the algebra is:
> > >
> > > ```
> > >    (project (?C)
> > >      (extend ((?C ?.0))
> > >        (group () ((?.0 (count)))
> > >          (join
> > >            (project (?X ?Y ?UUID)
> > >              (extend ((?UUID (struuid)))
> > >                (sequence
> > >                  (table (vars ?Y)
> > >                    (row [?Y 0])
> > >                    (row [?Y 1])
> > >                  )
> > >                  (table (vars ?X)
> > >                    (row [?X 0])
> > >                    (row [?X 1])
> > >                  ))))
> > >            (project (?X ?Y ?_)
> > >              (extend ((?_ 0))
> > >                (project (?X ?Y ?/RAND)
> > >                  (filter (< ?/RAND 0.95)
> > >                    (extend ((?/RAND (rand)))
> > >                      (sequence
> > >                        (table (vars ?Y)
> > >                          (row [?Y 0])
> > >                          (row [?Y 1])
> > >                        )
> > >                        (table (vars ?X)
> > >                          (row [?X 0])
> > >                          (row [?X 1])
> > >                        )))))))))))
> > > ```
> > >
> > > Note the outermost sequence operator has changed to a join operator.
> > > I don’t understand the logic behind that.
> > >
> > > Note that projecting the ?RAND variable from the second query does not
> > force the join.
> > >
> > > John
> > >
> > >> -----Original Message-----
> > >> From: John Walker <[email protected]>
> > >> Sent: Friday, 29 March 2024 08:55
> > >> To: [email protected]
> > >> Subject: RE: Performance question with joins
> > >>
> > >> I did a bit more experimentation by putting the second subquery
> > >> inside some other clauses:
> > >>
> > >> * FILTER EXISTS - no effect
> > >> * OPTIONAL - runtime around 0.5s
> > >> * MINUS - runtime around 0.5s
> > >>
> > >> So, I assume that the engine is doing some form of nested loop join
> > >> to iterate through each solution from the first subquery and test
> against
> > the second.
> > >> Same as what is happening with FILTER EXISTS.
> > >>
> > >> A "hack" to get around this seems to be to add a redundant MINUS {}
> > >> between the subqueries.
> > >>
> > >> John
> > >>
> > >>> -----Original Message-----
> > >>> From: John Walker <[email protected]>
> > >>> Sent: Friday, 29 March 2024 07:58
> > >>> To: jena-users-ml <[email protected]>
> > >>> Subject: Performance question with joins
> > >>>
> > >>> Hi,
> > >>>
> > >>> I am working with some data representing a 2D Cartesian coordinate
> > >>> system representing simple grid array “maps”
> > >>> The X and Y coordinates are represented as integers.
> > >>>
> > >>> I want to join data from different “layers” in the data.
> > >>> One layer contains a unique identifier for each position.
> > >>> The other layer only contains a subset of coordinates.
> > >>>
> > >>> I have written the following queries to simulate some data to
> > >>> demonstrate the issue I face.
> > >>>
> > >>> This query generates a solution field with the XY coordinates and an
> > >>> identifier for each:
> > >>>
> > >>> ```
> > >>> select ?X ?Y (struuid() as ?UUID)
> > >>> where {
> > >>>    values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> > >>>    values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> > >>>    bind ( ?X_i + 10 * ?X_j as ?X)
> > >>>    values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> > >>>    values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> > >>>    bind ( ?Y_i + 10 * ?Y_j as ?Y)
> > >>> }
> > >>> ```
> > >>>
> > >>> This query generates a filtered subset of the XY coordinates:
> > >>>
> > >>> ```
> > >>> select ?X ?Y
> > >>> where {
> > >>>    {
> > >>>      select ?X ?Y (rand() as ?RAND)
> > >>>      where {
> > >>>        values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> > >>>        values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> > >>>        bind ( ?X_i + 10 * ?X_j as ?X)
> > >>>        values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> > >>>        values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> > >>>        bind ( ?Y_i + 10 * ?Y_j as ?Y)
> > >>>      }
> > >>>    }
> > >>>    filter (?RAND < 0.95)
> > >>> }
> > >>> ```
> > >>>
> > >>> Now if I run each of those and write the results out as CSV using
> > >>> the sparql command line tool, each completes in under half a second.
> > >>>
> > >>> I can then join the results using Pandas in Python, which takes less
> > >>> than a second.
> > >>>
> > >>> However, if I try to do the join within the SPARQL process usings
> > >>> the following query with subqueries, it takes around 10 seconds:
> > >>>
> > >>> ```
> > >>> select (count(*) as ?C)
> > >>> where {
> > >>>    {
> > >>>      select ?X ?Y (struuid() as ?UUID)
> > >>>      where {
> > >>>        values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> > >>>        values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> > >>>        bind ( ?X_i + 10 * ?X_j as ?X)
> > >>>        values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> > >>>        values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> > >>>        bind ( ?Y_i + 10 * ?Y_j as ?Y)
> > >>>      }
> > >>>    }
> > >>>    {
> > >>>      select ?X ?Y
> > >>>      where {
> > >>>        {
> > >>>          select ?X ?Y (rand() as ?RAND)
> > >>>          where {
> > >>>            values ?X_i { 0 1 2 3 4 5 6 7 8 9 }
> > >>>            values ?X_j { 0 1 2 3 4 5 6 7 8 9 }
> > >>>            bind ( ?X_i + 10 * ?X_j as ?X)
> > >>>            values ?Y_i { 0 1 2 3 4 5 6 7 8 9 }
> > >>>            values ?Y_j { 0 1 2 3 4 5 6 7 8 9 }
> > >>>            bind ( ?Y_i + 10 * ?Y_j as ?Y)
> > >>>          }
> > >>>        }
> > >>>        filter (?RAND < 0.95)
> > >>>      }
> > >>>    }
> > >>> }
> > >>> ```
> > >>>
> > >>> In practice the “maps” I need to process contain hundreds of
> > >>> thousands of coordinates and then the final join using Jena becomes
> > >>> intractable in terms of memory and time.
> > >>> Whilst if I run each subquery separately and use Pandas for the
> > >>> join, it completes without breaking a sweat.
> > >>>
> > >>> I thought SPARQL engines were optimized for joins, so how come Jena
> > >>> gets absolutely spanked by Pandas here?
> > >>>
> > >>> Am I missing something when I write the queries?
> > >>>
> > >>> Regards,
> > >>>
> > >>> John Walker
> > >>> Principal Consultant & co-founder
> > >>>
> > >>> Semaku B.V. | Torenallee 20 (SFJ 3D) | 5617 BC Eindhoven | T +31 6
> > >>> 42590072 | https://semaku.com/
> > >>> KvK: 58031405 | BTW: NL852842156B01 | IBAN: NL94 INGB 0008 3219
> > 95
>

Re: Performance question with joins

Reply via email to