Re: How large is your solr index?

Peter Sturge Wed, 07 Jan 2015 13:19:31 -0800

> Is there a problem with multi-valued fields and distributed queries?

> No. But there are some components that don't do the right thing in
> distributed mode, joins for instance. The list is actually quite small and
> is getting smaller all the time.


Yes, joins is the main one. There used to be some dist constraints on
grouping, but that might be from the 3.x days of field collapsing.

> Sounds like you're doing something similar to us. In some cases we have a
> hard commit every minute. Keeping the caches hot seems like a very good
> reason to send data to a specific shard. At least I'm assuming that when
you
> add documents to a single shard and commit; the other shards won't be
> impacted...

> Not true if the other shards have had any indexing activity. The commit is
> usually forwarded to all shards. If the individual index on a
> particular shard is
> unchanged then it should be a no-op though.

This is an excellent point, and well worth taking some care on.
We do it by indexing to a number of shards, and only commit to those that
actually have something to commit - although an empty commit might be a
no-op on the indexing side, it's not on the automwarming/faceting side -
care needs to be taken so that you don't hose your caches unnecessarily.


On Wed, Jan 7, 2015 at 4:42 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> See below:
>
>
> On Wed, Jan 7, 2015 at 1:25 AM, Bram Van Dam <bram.van...@intix.eu> wrote:
> > On 01/06/2015 07:54 PM, Erick Erickson wrote:
> >>
> >> Have you considered pre-supposing SolrCloud and using the SPLITSHARD
> >> API command?
> >
> >
> > I think that's the direction we'll probably be going. Index size (at
> least
> > for us) can be unpredictable in some cases. Some clients start out small
> and
> > then grow exponentially, while others start big and then don't grow much
> at
> > all. Starting with SolrCloud would at least give us that flexibility.
> >
> > That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a
> certain
> > size, it would be better for us to simply add an extra shard, without
> > splitting.
> >
>
> True, and you can do this if you take explicit control of the document
> routing, but...
> that's quite tricky. You forever after have to send any _updates_ to the
> same
> shard you did the first time, whereas SPLITSHARD will "do the right thing".
>
> >
> >> On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <peter.stu...@gmail.com>
> >> wrote:
> >>>
> >>> ++1 for the automagic shard creator. We've been looking into doing this
> >>> sort of thing internally - i.e. when a shard reaches a certain size/num
> >>> docs, it creates 'sub-shards' to which new commits are sent and queries
> >>> to
> >>> the 'parent' shard are included. The concept works, as long as you
> don't
> >>> try any non-dist stuff - it's one reason why all our fields are always
> >>> single valued.
> >
> >
> > Is there a problem with multi-valued fields and distributed queries?
>
> No. But there are some components that don't do the right thing in
> distributed mode, joins for instance. The list is actually quite small and
> is getting smaller all the time.
>
> >
> >>> A cool side-effect of sub-sharding (for lack of a snappy term) is that
> >>> the
> >>> parent shard then stops suffering from auto-warming latency due to
> >>> commits
> >>> (we do a fair amount of committing). In theory, you could carry on
> >>> sub-sharding until your hardware starts gasping for air.
> >
> >
> > Sounds like you're doing something similar to us. In some cases we have a
> > hard commit every minute. Keeping the caches hot seems like a very good
> > reason to send data to a specific shard. At least I'm assuming that when
> you
> > add documents to a single shard and commit; the other shards won't be
> > impacted...
>
> Not true if the other shards have had any indexing activity. The commit is
> usually forwarded to all shards. If the individual index on a
> particular shard is
> unchanged then it should be a no-op though.
>
> But the usage pattern here is its own bit of a trap. If all your
> indexing is going
> to a single shard, then also the entire indexing _load_ is happening on
> that
> shard. So the CPU utilization will be higher on that shard than the older
> ones.
> Since distributed requests need to get a response from every shard before
> returning to the client, the response time will be bounded by the response
> from
> the slowest shard and this may actually be slower. Probably only noticeable
> when the CPU is maxed anyway though.
>
>
>
> >
> >  - Bram
> >
>

Re: How large is your solr index?

Reply via email to