Erick - These conflict, what's changed?

So if I were going to recommend settings, they’d be something like this:
Do a hard commit with openSearcher=false every 60 seconds.
Do a soft commit every 5 minutes.

vs

Index-heavy, Query-light
Set your soft commit interval quite long, up to the maximum latency you can
stand for documents to be visible. This could be just a couple of minutes
or much longer. Maybe even hours with the capability of issuing a hard
commit (openSearcher=true) or soft commit on demand.
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/




On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> > I've looked through SolrJ, DIH and others -- is the bottomline
> > across all of them to "batch updates" and not commit as long as possible?
>
> Of course it’s more complicated than that ;)….
>
> But to start, yes, I urge you to batch. Here’s some stats:
> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
>
> Note that at about 100 docs/batch you hit diminishing returns. _However_,
> that test was run on a single shard collection, so if you have 10 shards
> you’d
> have to send 1,000 docs/batch. I wouldn’t sweat that number much, just
> don’t
> send one at a time. And there are the usual gotchas if your documents are
> 1M .vs. 1K.
>
> About committing. No, don’t hold off as long as possible. When you commit,
> segments are merged. _However_, the default 100M internal buffer size means
> that segments are written anyway even if you don’t hit a commit point when
> you have 100M of index data, and merges happen anyway. So you won’t save
> anything on merging by holding off commits.
> And you’ll incur penalties. Here’s more than you want to know about
> commits:
>
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> But some key take-aways… If for some reason Solr abnormally
> terminates, the accumulated documents since the last hard
> commit are replayed. So say you don’t commit for an hour of
> furious indexing and someone does a “kill -9”. When you restart
> Solr it’ll try to re-index all the docs for the last hour. Hard commits
> with openSearcher=false aren’t all that expensive. I usually set mine
> for a minute and forget about it.
>
> Transaction logs hold a window, _not_ the entire set of operations
> since time began. When you do a hard commit, the current tlog is
> closed and a new one opened and ones that are “too old” are deleted. If
> you never commit you have a huge transaction log to no good purpose.
>
> Also, while indexing, in order to accommodate “Real Time Get”, all
> the docs indexed since the last searcher was opened have a pointer
> kept in memory. So if you _never_ open a new searcher, that internal
> structure can get quite large. So in bulk-indexing operations, I
> suggest you open a searcher every so often.
>
> Opening a new searcher isn’t terribly expensive if you have no autowarming
> going on. Autowarming as defined in solrconfig.xml in filterCache,
> queryResultCache
> etc.
>
> So if I were going to recommend settings, they’d be something like this:
> Do a hard commit with openSearcher=false every 60 seconds.
> Do a soft commit every 5 minutes.
>
> I’d actually be surprised if you were able to measure differences between
> those settings and just hard commit with openSearcher=true every 60
> seconds and soft commit at -1 (never)…
>
> Best,
> Erick
>
> > On Jun 2, 2019, at 3:35 PM, John Davis <johndavis925...@gmail.com>
> wrote:
> >
> > If we assume there is no query load then effectively this boils down to
> > most effective way for adding a large number of documents to the solr
> > index. I've looked through SolrJ, DIH and others -- is the bottomline
> > across all of them to "batch updates" and not commit as long as possible?
> >
> > On Sun, Jun 2, 2019 at 7:44 AM Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >> Oh, there are about a zillion reasons ;).
> >>
> >> First of all, most tools that show heap usage also count uncollected
> >> garbage. So your 10G could actually be much less “live” data. Quick way
> to
> >> test is to attach jconsole to the running Solr and hit the button that
> >> forces a full GC.
> >>
> >> Another way is to reduce your heap when you start Solr (on a test system
> >> of course) until bad stuff happens, if you reduce it to very close to
> what
> >> Solr needs, you’ll get slower as more and more cycles are spent on GC,
> if
> >> you reduce it a little more you’ll get OOMs.
> >>
> >> You can take heap dumps of course to see where all the memory is being
> >> used, but that’s tricky as it also includes garbage.
> >>
> >> I’ve seen cache sizes (filterCache in particular) be something that uses
> >> lots of memory, but that requires queries to be fired. Each filterCache
> >> entry can take up to roughly maxDoc/8 bytes + overhead….
> >>
> >> A classic error is to sort, group or facet on a docValues=false field.
> >> Starting with Solr 7.6, you can add an option to fields to throw an
> error
> >> if you do this, see: https://issues.apache.org/jira/browse/SOLR-12962.
> >>
> >> In short, there’s not enough information until you dive in and test
> >> bunches of stuff to tell.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>> On Jun 2, 2019, at 2:22 AM, John Davis <johndavis925...@gmail.com>
> >> wrote:
> >>>
> >>> This makes sense, any ideas why lucene/solr will use 10g heap for a 20g
> >>> index.My hypothesis was merging segments was trying to read it all but
> if
> >>> that's not the case I am out of ideas. The one caveat is we are trying
> to
> >>> add the documents quickly (~1g an hour) but if lucene does write 100m
> >>> segments and does streaming merge it shouldn't matter?
> >>>
> >>> On Sat, Jun 1, 2019 at 9:24 AM Walter Underwood <wun...@wunderwood.org
> >
> >>> wrote:
> >>>
> >>>>> On May 31, 2019, at 11:27 PM, John Davis <johndavis925...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> 2. Merging segments - does solr load the entire segment in memory or
> >>>> chunks
> >>>>> of it? if later how large are these chunks
> >>>>
> >>>> No, it does not read the entire segment into memory.
> >>>>
> >>>> A fundamental part of the Lucene design is streaming posting lists
> into
> >>>> memory and processing them sequentially. The same amount of memory is
> >>>> needed for small or large segments. Each posting list is in
> document-id
> >>>> order. The merge is a merge of sorted lists, writing a new posting
> list
> >> in
> >>>> document-id order.
> >>>>
> >>>> wunder
> >>>> Walter Underwood
> >>>> wun...@wunderwood.org
> >>>> http://observer.wunderwood.org/  (my blog)
> >>>>
> >>>>
> >>
> >>
>
>

Reply via email to