Re: Joining more than 2 collections

Joel Bernstein Wed, 03 May 2017 09:00:32 -0700

I've reformatted the expression below and made a few changes. You have put
things together properly. But these are MapReduce joins that require
exporting the entire result sets. So you will need to add qt=/export to all
the searches and remove the rows param. In Solr 6.6. there is a new
"shuffle" expression that does this automatically.


To test things you'll want to break down each expression and make sure it's
behaving as expected.

For example first run each search. Then run the innerJoin, not in parallel
mode. Then run it in parallel mode. Then try the whole thing.

hashJoin(parallel(collection2,
                            innerJoin(search(collection2,
                                                       q=*:*,

 fl="a_s,b_s,c_s,d_s,e_s",
                                                       sort="a_s asc",
                                                       partitionKeys="a_s",
                                                       qt="/export"),
                                           search(collection1,
                                                       q=*:*,

 fl="a_s,f_s,g_s,h_s,i_s,j_s",
                                                       sort="a_s asc",
                                                      partitionKeys="a_s",
                                                      qt="/export"),
                                           on="a_s"),
                             workers="2",
                             sort="a_s asc"),
               hashed=search(collection3,
                                         q=*:*,
                                         fl="a_s,k_s,l_s",
                                         sort="a_s asc",
                                         qt="/export"),
              on="a_s")

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, May 3, 2017 at 11:26 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Hi Joel,
>
> Thanks for the clarification.
>
> Would like to check, is this the correct way to do the join? Currently, I
> could not get any results after putting in the hashJoin for the 3rd,
> smallerStream collection (collection3).
>
> http://localhost:8983/solr/collection1/stream?expr=
> hashJoin(parallel(collection2
> ,
> innerJoin(
>  search(collection2,
> q=*:*,
> fl="a_s,b_s,c_s,d_s,e_s",
>              sort="a_s asc",
> partitionKeys="a_s",
> rows=200),
>  search(collection1,
> q=*:*,
> fl="a_s,f_s,g_s,h_s,i_s,j_s",
>              sort="a_s asc",
> partitionKeys="a_s",
> rows=200),
>          on="a_s"),
> workers="2",
>                  sort="a_s asc"),
>          hashed=search(collection3,
> q=*:*,
> fl="a_s,k_s,l_s",
> sort="a_s asc",
> rows=200),
> on="a_s")
> &indent=true
>
>
> Regards,
> Edwin
>
>
> On 3 May 2017 at 20:59, Joel Bernstein <joels...@gmail.com> wrote:
>
> > Sorry, it's just called hashJoin
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Wed, May 3, 2017 at 2:45 AM, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> > wrote:
> >
> > > Hi Joel,
> > >
> > > I am getting this error when I used the innerHashJoin.
> > >
> > >  "EXCEPTION":"Invalid stream expression innerHashJoin(parallel(
> innerJoin
> > >
> > > I also can't find the documentation on innerHashJoin for the Streaming
> > > Expressions.
> > >
> > > Are you referring to hashJoin?
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 3 May 2017 at 13:20, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> > wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > Thanks for the info.
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > > On 3 May 2017 at 02:04, Joel Bernstein <joels...@gmail.com> wrote:
> > > >
> > > >> Also take a look at the documentation for the "fetch" streaming
> > > >> expression.
> > > >>
> > > >> Joel Bernstein
> > > >> http://joelsolr.blogspot.com/
> > > >>
> > > >> On Tue, May 2, 2017 at 2:03 PM, Joel Bernstein <joels...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Yes you join more then one collection with Streaming Expressions.
> > Here
> > > >> are
> > > >> > a few things to keep in mind.
> > > >> >
> > > >> > * You'll likely want to use the parallel function around the
> largest
> > > >> join.
> > > >> > You'll need to use the join keys as the partitionKeys.
> > > >> > * innerJoin: requires that the streams be sorted on the join keys.
> > > >> > * innerHashJoin: has no sorting requirement.
> > > >> >
> > > >> > So a strategy for a three collection join might look like this:
> > > >> >
> > > >> > innerHashJoin(parallel(innerJoin(bigStream, bigStream)),
> > > smallerStream)
> > > >> >
> > > >> > The largest join can be done in parallel using an innerJoin. You
> can
> > > >> then
> > > >> > wrap the stream coming out of the parallel function in an
> > > innerHashJoin
> > > >> to
> > > >> > join it to another stream.
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > Joel Bernstein
> > > >> > http://joelsolr.blogspot.com/
> > > >> >
> > > >> > On Mon, May 1, 2017 at 9:42 PM, Zheng Lin Edwin Yeo <
> > > >> edwinye...@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> >> Hi,
> > > >> >>
> > > >> >> Is it possible to join more than 2 collections using one of the
> > > >> streaming
> > > >> >> expressions (Eg: innerJoin)? If not, is there other ways we can
> do
> > > it?
> > > >> >>
> > > >> >> Currently, I may need to join 3 or 4 collections together, and to
> > > >> output
> > > >> >> selected fields from all these collections together.
> > > >> >>
> > > >> >> I'm using Solr 6.4.2.
> > > >> >>
> > > >> >> Regards,
> > > >> >> Edwin
> > > >> >>
> > > >> >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Joining more than 2 collections

Reply via email to