Erick - These conflict, what's changed? So if I were going to recommend settings, they’d be something like this: Do a hard commit with openSearcher=false every 60 seconds. Do a soft commit every 5 minutes.
vs Index-heavy, Query-light Set your soft commit interval quite long, up to the maximum latency you can stand for documents to be visible. This could be just a couple of minutes or much longer. Maybe even hours with the capability of issuing a hard commit (openSearcher=true) or soft commit on demand. https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <erickerick...@gmail.com> wrote: > > I've looked through SolrJ, DIH and others -- is the bottomline > > across all of them to "batch updates" and not commit as long as possible? > > Of course it’s more complicated than that ;)…. > > But to start, yes, I urge you to batch. Here’s some stats: > https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/ > > Note that at about 100 docs/batch you hit diminishing returns. _However_, > that test was run on a single shard collection, so if you have 10 shards > you’d > have to send 1,000 docs/batch. I wouldn’t sweat that number much, just > don’t > send one at a time. And there are the usual gotchas if your documents are > 1M .vs. 1K. > > About committing. No, don’t hold off as long as possible. When you commit, > segments are merged. _However_, the default 100M internal buffer size means > that segments are written anyway even if you don’t hit a commit point when > you have 100M of index data, and merges happen anyway. So you won’t save > anything on merging by holding off commits. > And you’ll incur penalties. Here’s more than you want to know about > commits: > > https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > But some key take-aways… If for some reason Solr abnormally > terminates, the accumulated documents since the last hard > commit are replayed. So say you don’t commit for an hour of > furious indexing and someone does a “kill -9”. When you restart > Solr it’ll try to re-index all the docs for the last hour. Hard commits > with openSearcher=false aren’t all that expensive. I usually set mine > for a minute and forget about it. > > Transaction logs hold a window, _not_ the entire set of operations > since time began. When you do a hard commit, the current tlog is > closed and a new one opened and ones that are “too old” are deleted. If > you never commit you have a huge transaction log to no good purpose. > > Also, while indexing, in order to accommodate “Real Time Get”, all > the docs indexed since the last searcher was opened have a pointer > kept in memory. So if you _never_ open a new searcher, that internal > structure can get quite large. So in bulk-indexing operations, I > suggest you open a searcher every so often. > > Opening a new searcher isn’t terribly expensive if you have no autowarming > going on. Autowarming as defined in solrconfig.xml in filterCache, > queryResultCache > etc. > > So if I were going to recommend settings, they’d be something like this: > Do a hard commit with openSearcher=false every 60 seconds. > Do a soft commit every 5 minutes. > > I’d actually be surprised if you were able to measure differences between > those settings and just hard commit with openSearcher=true every 60 > seconds and soft commit at -1 (never)… > > Best, > Erick > > > On Jun 2, 2019, at 3:35 PM, John Davis <johndavis925...@gmail.com> > wrote: > > > > If we assume there is no query load then effectively this boils down to > > most effective way for adding a large number of documents to the solr > > index. I've looked through SolrJ, DIH and others -- is the bottomline > > across all of them to "batch updates" and not commit as long as possible? > > > > On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> Oh, there are about a zillion reasons ;). > >> > >> First of all, most tools that show heap usage also count uncollected > >> garbage. So your 10G could actually be much less “live” data. Quick way > to > >> test is to attach jconsole to the running Solr and hit the button that > >> forces a full GC. > >> > >> Another way is to reduce your heap when you start Solr (on a test system > >> of course) until bad stuff happens, if you reduce it to very close to > what > >> Solr needs, you’ll get slower as more and more cycles are spent on GC, > if > >> you reduce it a little more you’ll get OOMs. > >> > >> You can take heap dumps of course to see where all the memory is being > >> used, but that’s tricky as it also includes garbage. > >> > >> I’ve seen cache sizes (filterCache in particular) be something that uses > >> lots of memory, but that requires queries to be fired. Each filterCache > >> entry can take up to roughly maxDoc/8 bytes + overhead…. > >> > >> A classic error is to sort, group or facet on a docValues=false field. > >> Starting with Solr 7.6, you can add an option to fields to throw an > error > >> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962. > >> > >> In short, there’s not enough information until you dive in and test > >> bunches of stuff to tell. > >> > >> Best, > >> Erick > >> > >> > >>> On Jun 2, 2019, at 2:22 AM, John Davis <johndavis925...@gmail.com> > >> wrote: > >>> > >>> This makes sense, any ideas why lucene/solr will use 10g heap for a 20g > >>> index.My hypothesis was merging segments was trying to read it all but > if > >>> that's not the case I am out of ideas. The one caveat is we are trying > to > >>> add the documents quickly (~1g an hour) but if lucene does write 100m > >>> segments and does streaming merge it shouldn't matter? > >>> > >>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wun...@wunderwood.org > > > >>> wrote: > >>> > >>>>> On May 31, 2019, at 11:27 PM, John Davis <johndavis925...@gmail.com> > >>>> wrote: > >>>>> > >>>>> 2. Merging segments - does solr load the entire segment in memory or > >>>> chunks > >>>>> of it? if later how large are these chunks > >>>> > >>>> No, it does not read the entire segment into memory. > >>>> > >>>> A fundamental part of the Lucene design is streaming posting lists > into > >>>> memory and processing them sequentially. The same amount of memory is > >>>> needed for small or large segments. Each posting list is in > document-id > >>>> order. The merge is a merge of sorted lists, writing a new posting > list > >> in > >>>> document-id order. > >>>> > >>>> wunder > >>>> Walter Underwood > >>>> wun...@wunderwood.org > >>>> http://observer.wunderwood.org/ (my blog) > >>>> > >>>> > >> > >> > >