Re: Parallel SQL join on multivalue fields

Piero Scrima Wed, 01 Jul 2020 08:15:00 -0700

the reason why JOIN works is because of the Calcite framework. The parallel
sql features leverages Calcite, which implements all the sql features, all
you need is to provide the way for calcite to get the collection/table, in
solr this is done by the SolrTable.java (package
org.apache.solr.handler.sql), implementation of AbastrctQueryTable. Once
you implement the calcite table interface you basically have a calcite
adapter and calcite gives you for free all the sql features.
You can try by yourself, let say you have a collection called table1 with
field name1_s and field1_s (docvalues) and a collection called table2 with
fields name2_s and field2_s, we can populate the table1 with
{name1_s:"obj1_table1",field1_s:"a"},{name1_s:"obj2_table1",field1_s:"b"}
and table2 with
{name2_s:"obj1_table2",field2_s:"a"},{name2_s:"obj2_table2",field2_s:"d"}
then you can run:


curl --data-urlencode 'stmt=select a.name1_s,b.name2_s from table1 as a
inner join table2 as b on a.field1_s=b.field2_s limit 10'
http://localhost:8999/solr/table1/sql?aggregationMode=facet

the answer will be

{
  "result-set":{
    "docs":[{
        "name1_s":"obj1_table1",
        "name2_s":"obj1_table2"}
      ,{
        "EOF":true,
        "RESPONSE_TIME":xxx}]}}

it will work.
As I said it works because of the Calcite process, and I think that the
process is not optimized, moreover it does not work well with multivalued
fields. I think it would be great if solr parallel sql could have an
optimized join process (using join streaming api) and also have the support
for multivalued fields which could open several new use cases.

Il giorno mer 1 lug 2020 alle ore 15:31 Joel Bernstein <joels...@gmail.com>
ha scritto:

> There isn't any real support for joins in Parallel SQL currently. I'm
> surprised that you're having some success doing them. Can you provide a
> sample SQL join that is working for you?
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Fri, Jun 26, 2020 at 3:32 AM Piero Scrima <piersc...@gmail.com> wrote:
>
> > Hi,
> >
> > Although there is no trace of join functionality in the official Solr
> > documentation
> > (https://lucene.apache.org/solr/guide/7_4/parallel-sql-interface.html),
> > joining in parallel sql works in practice. It only works if the field is
> > not a multivalued field. For my project it would be fantastic if it also
> > worked with multivalued fields.
> > Is there any way to do it? working with the streaming expression I
> managed
> > to do it with the following expression:
> >
> > innerJoin(
> >     sort(
> >         cartesianProduct(
> >
> >
> >
> search(census_defence_system,q="*:*",fl="id,defence_system,description,supplier",sort="id
> > asc",qt="/select",rows="1000"),
> >           supplier
> >         ),
> >     by="supplier asc"
> >     ),
> >     sort(
> >       cartesianProduct(
> >
> >
> search(census_components,q="*:*",fl="id,compoenent_name,supplier",sort="id
> > asc",qt="/select",rows="10000"),
> >             supplier
> >         ),
> >         by="supplier asc"
> >     ),
> >   on="supplier"
> > )
> >
> > suplier of course is a multivalued field.
> >
> > Is there a way to do this with parallel sql, and if not can we plan a new
> > feature to add it? I could also work on it .
> >
> > (version 7.4)
> >
> > Thank you
> >
>

Re: Parallel SQL join on multivalue fields

Reply via email to