Hi Marius, if I have understood correctly you have a deleteByQuery for each
document, am I right?

On Thu, 16 Jun 2022 at 11:04, Marius Grigaitis
<[email protected]> wrote:

> Just a followup on the topic.
>
> * We checked settings on solr, seem quite default (especially on merge,
> commit strategies, etc)
> * We commit every 10 minutes
> * Added NewRelic to the Solr instance to gather more data and graphs
>
> In the end what caught our eye is a few deleteByQuery lines in stacks of
> running threads while Solr is overloaded. We temporarily removed
> deleteByQuery and it had around 10x performance improvement on indexing
> speed.
>
> How are we using deleteByQuery?
>
> update(add=[{uid: foo-123, sku: 123, ...}, {uid: bar-124, sku: 124} ...],
> deleteByQuery=["sku: 123 AND uid != foo-123", "sku: 123 AND uid !=
> bar-124"])
>
> UID is the uniqueKey for the index. We do this because "foo" or "bar" could
> change and we no longer want the previous document present.
>
> Ideally we should probably change our uniqueKey to be `sku` in this case
> and we would no longer need deleteByQuery but what could be interesting is
> why deleteByQuery causes such performance bottleneck as well as how we
> could potentially optimize it if we wanted to keep it?
>
> Marius
>
> On Wed, Jun 8, 2022 at 8:41 PM David Hastings <
> [email protected]>
> wrote:
>
> > > * Do NOT commit after each batch of 1000 docs. Instead, commit as
> seldom
> > as your requirements allows, e.g. try commitWithin=60000 to commit every
> > minute
> >
> > this is the big one.  commit after the entire process is done or on a
> > timer, if you don't need NRT searching, rarely does anyone ever need
> that.
> > the commit is a heavy operation and takes about the same time if you are
> > committing 1000 documents or 100k documents.
> >
> > On Wed, Jun 8, 2022 at 10:40 AM Jan Høydahl <[email protected]>
> wrote:
> >
> > > * Go multi threaded for each core as Shawn says. Try e.g. 2, 3 and 4
> > > threads
> > > * Experiment with different batch sizes, e.g. try 500 and 2000 -
> depends
> > > on your docs what is optimal
> > > * Do NOT commit after each batch of 1000 docs. Instead, commit as
> seldom
> > > as your requirements allows, e.g. try commitWithin=60000 to commit
> every
> > > minute
> > >
> > > Tip: Try to push Solr metrics to DataDog or some other service, where
> you
> > > can see a dashboard with stats on requests/sec, RAM, CPU, threads, GC
> etc
> > > which may answer your last question.
> > >
> > > Jan
> > >
> > > > 8. jun. 2022 kl. 14:06 skrev Shawn Heisey <[email protected]>:
> > > >
> > > > On 6/8/2022 3:35 AM, Marius Grigaitis wrote:
> > > >> * 9 different cores. Each weighs around ~100 MB on disk and has
> > > >> approximately 90k documents inside each.
> > > >> * Updating is performed using update method in batches of 1000,
> > around 9
> > > >> processes in parallel (split by core)
> > > >
> > > > This means that indexing within each Solr core is single-threaded.
> The
> > > way to increase indexing speed is to index in parallel with multiple
> > > threads or processes per index.  If you can increase the CPU power
> > > available on the Solr server when you increase the number of
> > > processes/threads sending data to Solr, that might help.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > >
> > >
> >
>
-- 
Vincenzo D'Amore

Reply via email to