Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Matt Kuiper Wed, 14 Jul 2021 14:55:14 -0700

Thanks Joel!  I will give this a try.  That is quite a performance boost.

Matt


On Tue, Jul 13, 2021 at 9:14 AM Joel Bernstein <[email protected]> wrote:

> The optimized join was added in Solr 8.8:
> https://issues.apache.org/jira/browse/SOLR-15049
>
> It kicks in when you use the join qparser plugin in the following scenario:
>
> 1) Do not specify a fromIndex. This is because the to and from index are
> the same.
> 2) The to and from fields are the same.
> 3) The join method is topLevelDV.
>
> {!join to=store_id from=store_id method=topLevelDV}
>
> If you do this with Solr 8.8+ you get the effect of SOLR-15049. It is a
> massive performance improvement. In my testing it was 7000 times faster
> then the standard join parser plugin for larger joins.
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Mon, Jul 12, 2021 at 1:34 PM Matt Kuiper <[email protected]> wrote:
>
> > Hi Joel,
> >
> > I reviewed a few options with my team, and your recommendation is at the
> > top of the list.  I believe it will work for our use case.
> >
> > You mentioned that if this approach worked, you would be willing to share
> > more details on an "optimized self join."
> >
> > I would enjoy hearing more.
> >
> > Thanks,
> > Matt
> >
> > On Fri, Jul 9, 2021 at 9:36 AM Joel Bernstein <[email protected]>
> wrote:
> >
> > > Block join is another option. If that works for you, from an indexing
> > > standpoint, it's the most performant query time join.
> > >
> > > If block indexing doesn't work for you then the optimized self join is
> > > almost as fast.
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> > > On Fri, Jul 9, 2021 at 11:31 AM Matt Kuiper <[email protected]>
> wrote:
> > >
> > > > Thanks Joel!
> > > >
> > > > On my list is to investigate Block Joins and Nested Child docs.
> > > >
> > > >
> > > >
> > >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#block-join-query-parsers
> > > >
> > > >
> > > >
> > >
> >
> https://solr.apache.org/guide/8_8/indexing-nested-documents.html#indexing-nested-documents
> > > >
> > > > However, it looks like you are not suggesting using nested docs, but
> > > > specifying a type field to differentiate between types of docs and
> > then a
> > > > join field.  Not having to build nested docs prior to updates would
> be
> > an
> > > > advantage.  And it makes sense that the join field would allow for
> > > reliable
> > > > routing to appropriate the shard for both doc types.
> > > >
> > > > I will take a further look and see if this approach will work, and
> get
> > > back
> > > > if more info is needed on the optimized self join.
> > > >
> > > > Thanks again,
> > > > Matt
> > > >
> > > >
> > > > On Fri, Jul 9, 2021 at 7:01 AM Joel Bernstein <[email protected]>
> > > wrote:
> > > >
> > > > > Can you solve this problem by adding all documents into the same
> > > > collection
> > > > > and performing self joins. You could add a field called rec_type to
> > > > > differentiate between the records.
> > > > >
> > > > > There are two good reasons for wanting to do this.
> > > > >
> > > > > 1) This allows you to route by the join key and easily co-locate
> > > records.
> > > > >
> > > > > 2) There is an optimized self join which is extremely fast that you
> > > could
> > > > > take advantage of if you did this.
> > > > >
> > > > > Let me know if this might be an option for you and we can discuss
> the
> > > > > optimized self join in more detail.
> > > > >
> > > > > Joel
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > >
> > > > > On Fri, Jul 2, 2021 at 6:28 PM Matt Kuiper <[email protected]>
> > wrote:
> > > > >
> > > > > > After some research, it appears the following approach may help
> in
> > > this
> > > > > > situation and relieve the requirement of collocating indexes for
> > > Joins.
> > > > > It
> > > > > > appears one drawback maybe the types of fields supported for the
> > JOIN
> > > > > > field.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://solr.apache.org/guide/8_8/other-parsers.html#cross-collection-join
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > > On Wed, Jun 30, 2021 at 11:59 AM Matt Kuiper <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Solr Group,
> > > > > > >
> > > > > > > I am not sure the following is a viable use-case, welcoming
> input
> > > and
> > > > > any
> > > > > > > implementation recommendations.
> > > > > > >
> > > > > > > I would like to perform joins over two sharded collections.
> > Where
> > > > docs
> > > > > > > are routed to specific shards based on a date range and are the
> > > same
> > > > > for
> > > > > > > shards in each collection.
> > > > > > >
> > > > > > > I understand that this means that the replicas from each
> > collection
> > > > > that
> > > > > > > hold data to be joined need to be collated on the same Solr
> > Server.
> > > >  I
> > > > > > > have read solutions that use ADD REPLICA to add a Collection B
> > > > replica
> > > > > to
> > > > > > > all SolrServers assuming Collection B has only one Shard.  For
> my
> > > use
> > > > > > case
> > > > > > > I need Collection B to have multiple shards.
> > > > > > >
> > > > > > > *Collection A                Collection B
> > SolrServer *
> > > > > > > Shard1_2020              Shard1_2020           172.33.0.1:8983
> > > _solr
> > > > > > > Shard2_2021              Shard2_2021           172.33.0.2:8983
> > > _solr
> > > > > > > Shard3_2022              Shard3_2022           172.33.0.3:8983
> > > _solr
> > > > > > >
> > > > > > > I think my question comes down to how do I break shards by a
> date
> > > > > range,
> > > > > > > and do it in a way that both Collections A and B would be
> defined
> > > by
> > > > > the
> > > > > > > same date range?  If could reliably break shards by date, and
> > know
> > > > the
> > > > > > date
> > > > > > > range of the shard, I think I could use ADD REPLICA api to
> align.
> > > > > > >
> > > > > > > Not sure a compositeId routing approach would work, but
> thinking
> > an
> > > > > > > implicit id may be hard to manage over time.
> > > > > > >
> > > > > > > Is an approach like this viable, concerned a bit about
> > > > > > > maintenance concerns, other ideas to support this join?
> > > > > > >
> > > > > > > Note: I am considering this within Time series collections...
> > > > > > >
> > > > > > > Matt
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Aligning Shards from different Collections on the same Solr server based on Date Range

Reply via email to