Sorry to have dropped the ball on this past couple of days. I created the JIRA with information shared on this mail thread.
https://issues.apache.org/jira/browse/SOLR-16838 Running further benchmarks reveals that the slowness is in the searcher.getFirstMatch() call inside getInputDocument() . *Solr 7.7.2* 2023-06-01 21:17:34.492 WARN (qtp1034094674-41) [ x:techproducts] o.a.s.u.p.LogUpdateProcessorFactory RTG timing stats:: tlogFetchTime: 508053 ; *searcherFetchTime: 3229011 * *Solr 8* 2023-06-01 20:43:31.767 WARN (qtp391506011-56) [ x:techproducts] o.a.s.u.p.LogUpdateProcessorFactory RTG timing stats:: tlogFetchTime: 410873 ; *searcherFetchTime: 33296008 * Starting a discussion on the Lucene list as well. Thanks, Rahul On Thu, Jun 1, 2023 at 9:17 PM Ishan Chattopadhyaya < ichattopadhy...@gmail.com> wrote: > > Solr unfortunately do not yet have a comprehensive official nightly > benchmark run that would catch such regressions. But the community is > working on it, there are some tests on http://mostly.cool maintained by > Ishan, but I have no idea how to add new tests.. > > I'll publish steps to add new tests soon. > > For now, check out solr-bench from GitHub.com/fullstorydev/solr-bench > > Run an existing test as follows: > > mvn clean compile assembly:single > > ./stress.sh -c <commitid> suites/cluster-test.json > > Results will be written out in suites/results/. > > > > On Fri, 2 Jun, 2023, 2:44 am Jan Høydahl, <jan....@cominvent.com> wrote: > > > Hi, > > > > Next step I believe is for you to open a Solr JIRA issue for this > > regression, as it is obviously an issue. > > Then we can use that JIRA to coordinate, even if the fix ends up being in > > Lucene code. > > > > Once we have a Solr JIRA, it may make sense to open a mail thread on > > users@lucene list to get some Lucene expertice involved in digging. > > > > Jan > > > > > 1. jun. 2023 kl. 18:15 skrev Rahul Goswami <rahul196...@gmail.com>: > > > > > > So I ran the test to index 5 million docs this time (batches of 1000 > docs > > > in 15 parallel threads). To eliminate the network overhead and get as > > > accurate a benchmark as possible, I used an AtomicLong to measure the > > time > > > around the RTG call in DistibutedUpdateProcessor across all calls ( > > > > > > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.7.2/solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java#L1416 > > ). > > > Did this for both Solr 7.7.2 and Solr 8.11.1 and built the > solr-core.jar > > to > > > replace it in the solr webapp lib. > > > > > > RTG in Solr 8.x is ~10x slower. Here are the numbers (times are in > > > milliseconds): > > > > > > *Solr 7.7.2 *: 2023-06-01 15:39:48.272 WARN (qtp1034094674-24) [ > > > x:techproducts] o.a.s.u.p.LogUpdateProcessorFactory *Total rtg > > time:7293486* > > > *Solr 8.11.1: *2023-06-01 04:46:24.758 WARN (qtp391506011-71) [ > > > x:techproducts] o.a.s.u.p.LogUpdateProcessorFactory *Total rtg > > > time:72029877* > > > > > > Let me know if I can help further with debugging. At this point I am > > > suspecting the slowness to be on Lucene's side with the findTargetArc() > > > method in FST.java. I am running more analysis on my side in parallel > and > > > will share if/when I find more. > > > > > > -Rahul > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jun 1, 2023 at 3:53 AM Jan Høydahl <jan....@cominvent.com> > > wrote: > > > > > >> Yes, the numbers should be similar. Thanks for digging in. If there > is a > > >> big regression, it could be worthwile running the same test on v8.0, > > 8.1, > > >> 8.2 etc to plot in what version the regression happened. Perhaps it > > could > > >> even be scripted :) > > >> > > >> Solr unfortunately do not yet have a comprehensive official nightly > > >> benchmark run that would catch such regressions. But the community is > > >> working on it, there are some tests on http://mostly.cool maintained > by > > >> Ishan, but I have no idea how to add new tests.. > > >> > > >> Jan > > >> > > >>> 31. mai 2023 kl. 23:50 skrev Rahul Goswami <rahul196...@gmail.com>: > > >>> > > >>> Sure, I can do that. Let me create an index with a few million docs, > > call > > >>> RTG with a few million iterations on it and note the times between > 7.x > > >> and > > >>> 8.x. I assume this should be sufficient (?) > > >>> > > >>> On Wed, May 31, 2023 at 5:19 PM Jan Høydahl <jan....@cominvent.com> > > >> wrote: > > >>> > > >>>> Would be nice to determine whether RTG is orders of magnitude slower > > in > > >>>> 8.x than 7.x and is the main culprit. Then we could isolate the > > >> testing to > > >>>> RTG only and not involce Atomic Update? > > >>>> > > >>>> Jan > > >>>> > > >>>>> 31. mai 2023 kl. 21:33 skrev Rahul Goswami <rahul196...@gmail.com > >: > > >>>>> > > >>>>> I don’t have any nested documents. And the results are consistent > > >> across > > >>>>> multiple runs. I tried looking for similar issues in the mailing > > list, > > >>>> but > > >>>>> couldn’t find anything relevant . So if you do happen to find any > > JIRAs > > >>>>> addressing it that would be really helpful (thanks!). > > >>>>> > > >>>>> To Jan’s question about RTG taking more time in Solr 8.x, I can say > > >> with > > >>>>> good certainty that this is the case. Although it does look into > > >>>>> transaction logs first, thread dumps suggest that it is the next > > phase > > >>>>> (when it doesn't find the doc in tlog) which seems to be time > > >> consuming . > > >>>>> It tries to look up the document via the current searcher > > >>>>> (searcher.getFirstMatch() ). Proceeding further in the stack, it is > > >> this > > >>>>> call where many threads are spending time: > > >>>>> > > >>>>> > > >>>> > > >> > > > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/codecs/blocktree/SegmentTermsEnum.java#L485 > > >>>>> > > >>>>> Although this call is the same in 7.7.2 and 8.11.1 quite likely > > >>>>> something changed in Lucene's FST.java which is causing the > > slowness. I > > >>>> am > > >>>>> trying to dig further and might also ask folks on the Lucene > mailing > > >>>> list. > > >>>>> Thanks. > > >>>>> > > >>>>> > > >>>>> > > >>>>> On Wed, May 31, 2023 at 11:36 AM Srijan <shree...@gmail.com> > wrote: > > >>>>> > > >>>>>> I would love some profiling as well. I know 8.8 or 8.9 had some > > >>>> performance > > >>>>>> problems with atomic update but this was later addressed. I cant > > find > > >>>> the > > >>>>>> jira atm though. Also I am on 8.11.1 and atomic update is not an > > issue > > >>>> for > > >>>>>> me. > > >>>>>> > > >>>>>> By the way, do you happen to have nested docs? > > >>>>>> > > >>>>>> > > >>>>>> On Wed, May 31, 2023, 11:20 Jan Høydahl <jan....@cominvent.com> > > >> wrote: > > >>>>>> > > >>>>>>> Hi > > >>>>>>> > > >>>>>>> MMap is most important for searching. Indexing bypasses the cache > > by > > >>>>>> using > > >>>>>>> direct IO. > > >>>>>>> > > >>>>>>> I have noticed slow real time get on Solr 8.x during atomic > update > > >>>>>> myself. > > >>>>>>> Would be interesting with a comparison with profiling. RTG gets > the > > >>>>>>> document from transaction log I believe? Could there be some RTG > > >>>> changes > > >>>>>> in > > >>>>>>> 8.x that caused such slowdown? > > >>>>>>> > > >>>>>>> Jan Høydahl > > >>>>>>> > > >>>>>>>> 31. mai 2023 kl. 16:57 skrev Rahul Goswami < > rahul196...@gmail.com > > >: > > >>>>>>>> > > >>>>>>>> Thanks for the response Shawn. We are using Windows server with > > >>>> pretty > > >>>>>>> huge > > >>>>>>>> indexes (multiple TB cores). With Mmap, I have observed that the > > >>>>>> machine > > >>>>>>>> just completely freezes with high CPU and memory usage to a > point > > >>>> where > > >>>>>>> it > > >>>>>>>> becomes impossible to even connect to it. SimpleFS works out > well > > >> for > > >>>>>> us > > >>>>>>> in > > >>>>>>>> this case. > > >>>>>>>> > > >>>>>>>> As noted in my first email, even with SimpleFS, Solr 7 completes > > the > > >>>>>>> crawl > > >>>>>>>> in nearly 1/5th the time taken in Solr 8. Hence there should be > > >>>>>> something > > >>>>>>>> OUTSIDE the directory factory in the code which is causing this. > > >>>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> Rahul > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> On Tue, May 30, 2023 at 10:47 PM Shawn Heisey < > > apa...@elyograg.org > > >>> > > >>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> On 5/30/23 15:34, Rahul Goswami wrote: > > >>>>>>>>>> Environment details: - Java 11 on Windows server - Xms1536m > > >> Xmx3072m > > >>>>>> - > > >>>>>>>>>> Indexing client code running 15 parallel threads indexing in > > >> batches > > >>>>>> of > > >>>>>>>>>> 1000 - using SimpleFSDirectoryFactory (since Mmap doesn't > quite > > >> work > > >>>>>>>>>> well on Windows for our index sizes which commonly run north > of > > 1 > > >>>> TB) > > >>>>>>>>> > > >>>>>>>>> Don't change the directoryFactory. You *WANT* Solr to use MMAP > > for > > >>>>>> your > > >>>>>>>>> indexes. Not using MMAP is likely to slow things down > > >> considerably. > > >>>>>>>>> MMAP should work just fine on 64-bit Windows with 64-bit Java. > > >> Which > > >>>>>> of > > >>>>>>>>> course requires 64-bit hardware. > > >>>>>>>>> > > >>>>>>>>> 32 bit systems and software cannot properly deal with data > larger > > >>>> than > > >>>>>>>>> about 2GB. > > >>>>>>>>> > > >>>>>>>>> Thanks, > > >>>>>>>>> Shawn > > >>>>>>>>> > > >>>>>>> > > >>>>>> > > >>>> > > >>>> > > >> > > >> > > > > >