Just for the reference. This is how it can be solved with index time join.
If we put full copies of group_member docs as children. Then we can search
for parent docs and remove joined results.
eg q={!v=$mainq} -{!v=$grpmembrz} -{!parent .. filters=$grpmembrz
v=$mainq}
There are a few details and you see how cumbersome updates will be, but if
docs are not really huge and update rate is moderate, it might be a way to
go instead of query time join.
On Thu, Jun 15, 2023 at 3:08 PM Ron Haines <[email protected]> wrote:
> yes, we would return 'D'.
>
> So, are you asking why not just do the join in the main index? I started
> that way, then realized that a document that 'belongs' to another doc both
> need to be on the same shard for the join to work. That's when I moved to
> the 'fromIndex' approach and created the small 'fromIndex' collection (uner
> 200k docs), single-sharded, replicated across all of the shards of the main
> collection.
>
> On Thu, Jun 15, 2023 at 5:57 AM Mikhail Khludnev <[email protected]> wrote:
>
> > Thanks for the clarification, Ron.
> > Why the membership is extracted into a separate index?
> > Join is heavy anyway, but run it cross core is even more heavier.
> >
> > Example you give is not really specific. I can implement it via
> > fq=-group_member_id:*
> >
> > Let's extend it
> > doc# group_id. group_member_id
> > 1. A. C
> > 2. B -
> > 3. C -
> > 4. D *G*
> > 5. E B
> > 6. F. -
> > 7. G
> >
> > So, if a user runs a query that finds docs A,B,C,D,E,F. (not G)
> > Should it return D?
> >
> >
> > On Thu, Jun 15, 2023 at 6:01 AM Ron Haines <[email protected]> wrote:
> >
> > > adding more context as to why we are using the 'join'.
> > >
> > > We have a collection of documents where all documents have a 'group_id'
> > > (which is essentially the doc's id). And, some docs have a
> > > 'group_member_id' that indicates if that doc belongs to a 'group_id'.
> > For
> > > example:
> > >
> > > doc# group_id. group_member_id
> > > 1. A. C
> > > 2. B -
> > > 3. C -
> > > 4. D C
> > > 5. E B
> > > 6. F. -
> > >
> > > So, if a user runs a query that finds docs A,B,C,D,E,F we do not want
> to
> > > include any of the documents that belong to any of the group_id's. So,
> > for
> > > this search we really want a result count of 3 (docs B, C, F).
> > > We want to exclude:
> > > A because it belongs to C
> > > D because it belongs to C
> > > E because it belongs to B
> > >
> > > This negative 'join' &fq is how we are excluding these docs. Note
> that a
> > > document can 'belong' to more than 1 document. So, yes, it does affect
> > the
> > > result count, if that was a question.
> > >
> > > Thanks for the suggestions. I still have to run the test with the
> > > 'method=topLevelDv', and I will pursue getting ThreadDumps. Thx. More
> > to
> > > come....
> > >
> > > On Wed, Jun 14, 2023 at 4:26 PM Mikhail Khludnev <[email protected]>
> > wrote:
> > >
> > > > Note: images are shredded in the mailing list.
> > > > Well, if we apply heavy operation (join) it's reasonable that it warm
> > > CPU.
> > > > It should impact number of results. Does it?
> > > > Overall, the usage seems non-typical: query looks like role based
> > access
> > > > control (or group membership problem), but has dismax as a sub-query.
> > > Can't
> > > > docs be remodelled somehow in a more efficient manner?
> > > > It's worth understanding what keeps CPU busy, usually a few thread
> > dumps
> > > > under load gives a useful clue.
> > > > Also, if "to" side is huge and highly sharded, and "from" is small,
> and
> > > > updates are rare, index-time join via {!parent} may work well.
> Caveat -
> > > it
> > > > may be cumbersome..
> > > > PS, I suggested two jiras earlier, I don't think they are applicable
> > > here.
> > > >
> > > > On Wed, Jun 14, 2023 at 8:26 PM Ron Haines <[email protected]>
> wrote:
> > > >
> > > > > Fyi, I am finally getting back to this. I apologize for the delay.
> > > > >
> > > > >
> > > > >
> > > > > I am going to try using the ‘method=topLevelDV’ option to see if
> that
> > > > > makes a difference. I will run same tests used below, and follow
> up
> > > with
> > > > > results.
> > > > >
> > > > >
> > > > >
> > > > > As far as more details about this scenario:
> > > > >
> > > > > - Per the ‘user query’. Some of them are quite simple, edismax,
> > > > > q=Maricopa county ethel
> > > > > - from a content point of view, updates are not happening very
> > > > > frequently. Typically get batches of updates spread out over
> the
> > > > course of
> > > > > the day.
> > > > > - not quite sure what you are asking for per the 'collection
> > > > > definitions'. The main collection is about 27 million docs,
> > across
> > > 96
> > > > > shards, 2 replicas. The fromIndex 'join' collection is quite
> > > > small...about
> > > > > 80k docs, single shard, but replicated across the 96 shards.
> > > > > - in the table below are the qtimes, response times, run both
> > > > > with/without using the ‘join’. Also have resultCount, for
> > > reference.
> > > > > - it is a small test sample iof 12 queries, single-threaded,
> > > > > - Note, the qtimes…on average, for this small query set,
> > > increases
> > > > > about 40% with the join
> > > > >
> > > > >
> > > > > search_qtime - no join
> > > > >
> > > > > responseTime - no join
> > > > >
> > > > > search_qtime - with join
> > > > >
> > > > > responseTime - with join
> > > > >
> > > > > resultCount
> > > > >
> > > > > 1748
> > > > >
> > > > > 3179
> > > > >
> > > > > 2834
> > > > >
> > > > > 4292
> > > > >
> > > > > 471894
> > > > >
> > > > > 1557
> > > > >
> > > > > 2865
> > > > >
> > > > > 1794
> > > > >
> > > > > 3108
> > > > >
> > > > > 332
> > > > >
> > > > > 929
> > > > >
> > > > > 2278
> > > > >
> > > > > 1261
> > > > >
> > > > > 2654
> > > > >
> > > > > 541282
> > > > >
> > > > > 813
> > > > >
> > > > > 2107
> > > > >
> > > > > 1036
> > > > >
> > > > > 2322
> > > > >
> > > > > 15347
> > > > >
> > > > > 413
> > > > >
> > > > > 1730
> > > > >
> > > > > 539
> > > > >
> > > > > 1838
> > > > >
> > > > > 42
> > > > >
> > > > > 388
> > > > >
> > > > > 1725
> > > > >
> > > > > 678
> > > > >
> > > > > 2027
> > > > >
> > > > > 313
> > > > >
> > > > > 1095
> > > > >
> > > > > 2481
> > > > >
> > > > > 1453
> > > > >
> > > > > 2821
> > > > >
> > > > > 435627
> > > > >
> > > > > 829
> > > > >
> > > > > 2263
> > > > >
> > > > > 1310
> > > > >
> > > > > 2739
> > > > >
> > > > > 299
> > > > >
> > > > > 838
> > > > >
> > > > > 2103
> > > > >
> > > > > 1081
> > > > >
> > > > > 2358
> > > > >
> > > > > 86049
> > > > >
> > > > > 1236
> > > > >
> > > > > 2610
> > > > >
> > > > > 1911
> > > > >
> > > > > 3283
> > > > >
> > > > > 77881
> > > > >
> > > > > 950
> > > > >
> > > > > 2274
> > > > >
> > > > > 1313
> > > > >
> > > > > 2661
> > > > >
> > > > > 15160
> > > > >
> > > > > 763
> > > > >
> > > > > 2066
> > > > >
> > > > > 885
> > > > >
> > > > > 2184
> > > > >
> > > > > 738
> > > > >
> > > > > What is most concerning is the cpu increase that we see in Solr.
> > Here
> > > > is
> > > > > a more ‘concurrent' test, at about 12 qps, but it is not at a
> 'full'
> > > > > load...maybe 50%. This test 'held up', meaning we did not get into
> > any
> > > > > trouble.
> > > > >
> > > > >
> > > > > Hope these images comes thru...but, here is a cpu profile for a 1
> > hour
> > > > > test with no 'join' being used,
> > > > >
> > > > >
> > > > > [image: image.png]
> > > > >
> > > > > And, here is the same 1 hour test, using the 'join', run twice.
> Not
> > > the
> > > > > difference in 'scale' of cpu of these 2 tests vs. the one above,
> > from a
> > > > > 'cores' point of view:
> > > > > [image: image.png]
> > > > >
> > > > > Like I said, I'll run these same tests with the
> ‘method=topLevelDV’,
> > > and
> > > > > see if it changes behavior.
> > > > >
> > > > > Thx
> > > > >
> > > > > Ron Haines
> > > > >
> > > > > On Thu, May 25, 2023 at 4:29 PM Mikhail Khludnev <[email protected]>
> > > > wrote:
> > > > >
> > > > >> Ron, how often both indices are updated? Presumably if they are
> > > static,
> > > > >> filter cache may help.
> > > > >> It's worth making sure that the app gives a chance to filter
> cache.;
> > > > >> To better understand the problem it is worth taking a few
> treadumps
> > > > under
> > > > >> load: a deep stack gives a clue for hotspot (or just take a
> sampling
> > > > >> profile). Once we know the hot spot we can think about a
> workaround.
> > > > >> https://issues.apache.org/jira/browse/SOLR-16717 about sharding
> > > > >> "fromIndex"
> > > > >> https://issues.apache.org/jira/browse/SOLR-16242 about keeping
> > > > "local/to"
> > > > >> index cache when fromIndex is updated.
> > > > >>
> > > > >> On Thu, May 25, 2023 at 5:01 PM Andy Lester <[email protected]>
> > > wrote:
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> > > On May 25, 2023, at 7:51 AM, Ron Haines <[email protected]>
> > > wrote:
> > > > >> > >
> > > > >> > > So, when this feature is enabled, this negative &fq gets
> added:
> > > > >> > > -{!join fromIndex=primary_rollup from=group_id_mv
> > > to=group_member_id
> > > > >> > > score=none}${q}
> > > > >> >
> > > > >> >
> > > > >> > Can we see collection definitions of both the source collection
> > and
> > > > the
> > > > >> > join? Also, a sample query, not just the one parameter? Also,
> how
> > > > often
> > > > >> are
> > > > >> > either of these collections updated? One thing that killed off
> an
> > > > entire
> > > > >> > project that we were doing was that the join table was getting
> > > updated
> > > > >> > about once a minute, and this destroyed all our caching, and
> made
> > > the
> > > > >> > queries we wanted to do unusable.
> > > > >> >
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Andy
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Sincerely yours
> > > > >> Mikhail Khludnev
> > > > >>
> > > > >
> > > >
> > > > --
> > > > Sincerely yours
> > > > Mikhail Khludnev
> > > >
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>
--
Sincerely yours
Mikhail Khludnev