Re: Performance Implications of Different Routing Schemes

2018-03-05 Thread Stephen Lewis
Hi!

Thank you for your thoughtful response Shawn! I have a couple of follow up
questions and points to check my understanding on.

Thanks for explaining my misunderstanding on implicit routing. So to repeat
back and check my understanding: implicit routing may be either left up to
SOLR to distribute, or you may specify the router.field parameter at
collection creation time. If there is no router.field parameter specified,
SOLR will distribute documents to shards based solely on a hash of the
docID; if router.field is defined, SOLR will distribute documents to shards
based solely on the hash of the router.field value on the doc. Correct?

So let me focus a bit more on the composite ID router. The options
available here are:

   - single prepend routing (tenant1!id1)
   - multi prepend routing (tenant1!piece1!id1)
   - num-bits prepend routing (tenant1/B!id1)

I think the first two are relatively straight forward; the ask is on the
application layer to supply one or two prepends, and then SOLR will find an
appropriate shard to host the document based on a hash of the prepend(s).

I'm very interested though by the num-bits prepend. (By the way, I never
found an agreed-upon name for this, so let me know if there is something
standard I should call this). Originally when I wrote I had a
misunderstanding here, but I believe I've understood it full now. If B is
your "bits" param, then a tenant will be spread out over 1/(2^B) fraction
of the shards; so if B = 0, any shard may end up hosting the doc; if B = 1,
half of the shards may be the one to host the doc; if B = 2, one quarter of
the shards may be the one to host the doc; etc

I was still a bit uncertain about the mechanism until looked deeper into this
documentation

and this article
: "[The
composite ID router] creates a composite 32 bit hash by taking 16 bits from
the shard key’s hash and 16 bits from the document id’s hash When a
tenant is too large to fit on a single shard it can be spread across
multiple shards be specifying the number of bits to use from the shard key.
The syntax for this is: shard_key/num!document_id. The /num is the number
of bits from the shard key to use in the composite hash." So I think this
mean that first the hashes of each are computed, and then the bits are
taken from the resultant hash values. Is that correct? So I believe that
means that if num = 16, then that would be the same as omitting the /num
param. I believe it also implies that if the number of shards is not a
power of 2, some irregularity in number of shards will be experienced;
e.g., if there is a collection with 11 shards and num = 2; 11/2^2 = 2.75;
then every tenant will live on at least 3 shards, some may end up living on
4 depending on exactly how the ranges work out.

Some of what implicit routing offers could be quite desirable if the
document shard-routing ever needs to be updated. Specifically, the ID of
the document could remain constant even when making updates to the routing
key, which I believe would allow in-place updates to a document which
changes its host shard. So for example, if I create a collection using
implicit routing and router.field=shard_key, then I can insert a document
with id=1 and shard_key=1, then later insert a document with id=1 and
shard_key=2, and the original document sent with shard_key = 1 will be
automatically deleted on insert of the new document. Can you confirm
whether this is true? Or would the original document with id = 1 and
shard_key = 1 not necessarily be deleted?

The only drawback then I see of using implicit routing would be that there
is no equivalent to the num-bits prepend: if you want to spread out your
documents associated with a given tenant across fewer than all but more
than one shard, it falls back to the developer to update the shard key. Is
this correct, or is there an equivalent notion to num-bits for implicit
routing?

There's one other thing I want to drill down on a bit more with resource
usage:


>
>
> *Query performance should be about the same for any routing type.  It does
> look like when you use compositeId and actually implement shard keys, you
> can limit queries to those shards, but a *general* query is going to hit
> all shards.*
>

Currently I have experience targeting 1-shard at a time with queries. My
goal in architecture here is to be a little more flexible, and instead keep
the number of shards a given query has to hit approximately constant even
as the user base and solr cloud grow. I believe that will keep CPU sprawl
at a minimum, more on that below.

>
>
>
>
>
> * If your query rate is very low (or shards are distributed across a lot
> of hardware that has significant spare CPU capacity) performance isn't
> going to be dramatically different for a query that hits 2 shards versus
> one that hits 6 shards.  If your query rate is 

Re: Performance Implications of Different Routing Schemes

2018-03-02 Thread Shawn Heisey
On 3/2/2018 11:43 AM, Stephen Lewis wrote:
> I'm wondering what information you may be able to provide on performance
> implications of implicit routing VS composite ID routing. In particular,
> I'm curious what the horizontal scaling behavior may be of implicit routing
> or composite ID routing with and without the "/" param appended on.

The hash calculations should probably introduce so little overhead that
you'd never notice it.

I once implemented a hash algorithm using two hash classes built into
Java.  I'm pretty sure that it was NOT a fast implementation ... and it
could calculate over a million hashes per second on input strings of
about 20 characters.

The hash algorithm used by CompositeId (one of the MurmurHash
implementations) is supposed to be one of the fastest algorithms in the
world.  Unless your uniqueId field values are extremely huge, I really
doubt that hash calculation is a significant source of overhead.

The use of implicit doesn't automatically mean there's no overhead for
routing.  The implicit router can still redirect documents to different
shards, it just does it explicitly, usually with a shard name in a
particular field, rather than by hash calculation.

> A relatively simple assessment I've done belowleads me to believe the
> following is likely the case: if we have S shards and B as our "/bits"
> param, then resource usage would Big O scale as follows (note: Previously
> I've received the advice that any shard should be capped at a max of 120M
> documents, which is where the cap on docs/shard-key comes from)

Query performance should be about the same for any routing type.  It
does look like when you use compositeId and actually implement shard
keys, you can limit queries to those shards, but a *general* query is
going to hit all shards.

If your query rate is very low (or shards are distributed across a lot
of hardware that has significant spare CPU capacity) performance isn't
going to be dramatically different for a query that hits 2 shards versus
one that hits 6 shards.  If your query rate is high, then more shards
probably will be noticeably slower than fewer shards.

For the maximum docs to allow per shard:  It depends.  For some indexes
and use cases, a million documents per shard might be way too big.  For
others, 500 million per shard might have incredible performance.  There
are no hard rules about this.  It's entirely dependent on what you're
actually doing.

Thanks,
Shawn