Yet, that is a task which the main application, Solr, could and
should undertake, rather than ask we human slaves to add sundry programs
to tend it from afar.
Similarly, it would be useful for there to be feedback from Solr
when adding material so that we don't overwhelm parts of the pipeline.
That's a classical problem with known solutions.
Thanks,
Joe D.
On 21/04/2018 19:16, Erick Erickson wrote:
Yeah, trying to have something that satisfies all use cases is a bear.
I know of one installation where the indexing rate was so huge that
they couldn't afford to have any merging (80B docs/day) so in that
situation any heuristics built into Solr would be wrong.
Here's an alternate approach to having buttons where you have to
attend to it each day:
http://localhost:8983/solr/admin/cores?action=STATUS
returns each core and the number of docs, maxdocs, and deleted docs.
One could set up a cron job that runs every night at 3:00 am that then
sends the optimize command to any core with greater than X% deleted
docs, where X is your locally-determined threshold. That would be less
work actually than having to attend to it every day.
FWIW
On Sat, Apr 21, 2018 at 10:55 AM, Joe Doupnik <j...@netlab1.net> wrote:
A good find Erick, and one which brings into focus the real problem at
hand. That overload case would happen if there were an Optimise button or if
the curl equivalent command were issued, and is not a reason to avoid
either/both.
So, what could be done to avoid such awkward difficulties?
Well, an obvious suggestion, without knowing the details, is might the
system be able to estimate internal conditions sufficiently to issue a
warning and decline an Optimise. Certainly average system managers are not
about to decode and monitor Java VM nuances.
Discussion about automating removals based on sizes of this and that
seem, from this distance, to be musings yet to face the real world. In the
meanwhile we need to control matters, hence the button request.
The resource consumption issue is inherent in such systems, and we in
the field have very little information to help make choices. I know, it's
not a simple affair, and too many buzz words fly about. Thus the engineers
close to the code might have a ponder about the above predictive capability
and about the overall resource consumption process which might permit the
system to adapt to progressively larger loads over time.
In my own situation I feed material into Solr a file at a time, give a
small pause, repeat, get to 100 entries and wait a bit longer, and so on
every file, hundred files, thousand files. This works well to reduce
resource peaks and uncompleted operations, and it lets the system run in the
background all day if necessary without disturbing main activities. My
longest run was over a full day, 660+K documents which worked just fine and
did not upset other activities in the machine.
Thanks,
Joe D.
On 21/04/2018 17:54, Erick Erickson wrote:
Joe:
Serendipity strikes, The thread titled "JVM Heap Memory Increase (SOLR
CLOUD)" is a perfect example of why the optimize button is so
"fraught".
Best,
Erick
On Sat, Apr 21, 2018 at 9:43 AM, Erick Erickson <erickerick...@gmail.com>
wrote:
Joe:
Thanks for moving the conversation over here that we were having on
the blog post. I think the wider audience will benefit from this going
forward.
bq: ...apparent inability to remove piles of deleted docs
do note that deleted docs are removed during normal indexing when
segments are merged, they're not permanently retained in the index.
Part of the thinking behind SOLR-7733 is exactly that once you press
the very tempting optimize button, you can get into a situation where
your one huge segment does _not_ have the deleted docs removed until
the "live" document space is < 2.5G. Thus if you have a 100G segment
after optimize, it'll look like deleted docs are never removed until
at least 97.5% of the docs are deleted. The default max segment size
is 5G, and the current algorithm doesn't consider segments eligible
for merging until 50% of that maximum number consists of "live" docs.
The optimize functionality in the admin UI was removed as part of
SOLR-7733 from the screen that comes up when you select a core, but
the "core admin" screen still has the optimize button that comes and
goes depending on whether there are any deleted documents or not. This
page is only visible in standalone mode.
Unfortunately SOLR-7733 removed the functionality that actually sent
the optimize command from the javascript, so pressing the optimize
button does nothing. This is indeed a bug, see: SOLR-12253 which will
remove the button from the core admin screen in stand-alone mode.
Optimize (aka forceMerge) is pretty actively discouraged because it is:
1> very expensive
2> has significant "gotchas" (we chatted in comments in the blog post
about the gotchas).
So we made a decision to make it more of an 'expert' option, requiring
users to issue a curl/Browser URL command like
"....solr/core_or_collection/update?optimize=true" if this
functionality is really desirable in their situation. Docs will be
updated too, they're lagging a bit.
Coming probably in Solr 7.4 is a new parameter (tentatively) for
TieredMergePolicy (TMP) that puts a soft ceiling on the percentage of
deleted docs in an index. The current version of this patch
(LUCENE-7976) sets this threshold at 20% at the expense of about 10%
more I/O in my tests from the current TMP implementation. Under
discussion is how low to allow this to be, we're thinking 10% as a
floor, and what the default should be. The current TMP caps the
percentage deleted docs at close to 50%.
The thinking behind not allowing the percent deleted documents to be
too low is that that would trigger its own massive I/O issues,
rewriting "live" documents over and over and over. For NRT indexes,
that's almost certainly a horrible tradeoff. For more static indexes,
the "expert" API command is still available.
Best,
Erick
On Sat, Apr 21, 2018 at 5:08 AM, Joe Doupnik <j...@netlab1.net> wrote:
In Solr v7.3.0 the ability to removed "deleted" docs from a core by
use
of what until then was the Optmise button on the admin GUI has been
changed
in an ungood way. That is, in the V7.3.0 Changes list, item SOLR 7733
(quote
remove "optmize from the UI, end quote). The result of that is an
apparent
inability to remove piles of deleted docs, which amongst other things
means
wasting disk space. That is a marked step backward and is unhelpful for
use
of Solr in the field. As other comments in the now closed 7733 ticket
explain, this is a user item whidh has impact on their site, and it
ought to
be an inherent feature of Solr. Consider a file system where complete
deletes are forbidden, or your kitchen where taking out the rubbish is
denied. Hand waving about obscure auto-sizing notions will not suffice.
Thus
may I urge that the Optimse button and operation be returned to use, as
it
was until Solr v7.3.0.
Thanks,
Joe D.