Re: Micro-Sharding

Ted Dunning Sat, 03 Dec 2011 13:26:03 -0800

On Sat, Dec 3, 2011 at 10:54 AM, Shawn Heisey <s...@elyograg.org> wrote:

> In another thread, something was said that sparked my interest:
>
> On 12/1/2011 7:17 PM, Ted Dunning wrote:
>
>> Of course, resharding is almost never necessary if you use micro-shards.
>>  Micro-shards are shards small enough that you can fit 20 or more on a
>> node.  If you have that many on each node, then adding a new node consists
>> of moving some shards to the new machine rather than moving lots of little
>> documents.
>>
>> Much faster.  As in thousands of times faster.
>>
>
> ...
>
> Currently I split my data into shards in two ways.  The most recent data
> (between 3.5 to 7 days, trying to keep it below 500,000 records) goes into
> one shard.  The rest of the data is split using the formula crc32(did) %
> numShards.  The value of numShards is currently six.  Each of those large
> shards has nearly 11 million documents in 20GB of disk space.
>

OK.  That is a relatively common arrangement.

I am already using the concept of micro-sharding, but certainly not on a
> grand scale.  One copy of the index is served by two hosts with 8 CPU
> cores, so each host has three of the large shards.  Doing some least common
> multiple calculations, I have determined that 420 shards would allow me to
> use the shard-moving method to add one host at a time until I am up to 7
> hosts.  To reach 8, I would need 840 shards, and to make it to 9 or 10, I
> would need 2520 shards.

Not really.  You have this factorial explosion only if you require exactly
even numbers of shards.  But once the number of shards is larger than the
number of cores on your machine, you really don't need to balance exactly
evenly.

For instance, obviously 8 shards on each of three machines (24 total)
splits evenly after adding one node to get four machines (6 each), but when
you move to 5 machines, it isn't so bad.  Each machine but one will have 5
shards and the last one will have 4.  At six nodes, the split is even
again.  At seven nodes, you have three nodes with 4 shards and four nodes
with 3 which is beginning to be a bit unbalanced.  At eight, we have an
even balance again.

Thus, you can get away with >2x scaling even if you start with a very
modest number of shards on a small number of machines.  Remember also that
scaling down is important as well and going from 3 to 2 nodes works just
fine.

...
> I am curious as to the amount of overhead that large numbers of shards
> would introduce.  I already know from experience that when an index is
> optimized from 20-30 largish segments (initial full index) to one, it
> shrinks a little bit.  I assume that there would be similar overhead
> involved in having a lot of shards.  Does anyone have any way to know how
> much overhead that would be?
>

The overhead in size is very modest and isn't really the problem.  The
issue is that there is a noticeable amount of repeated work in repeating
the query on multiple shards.  This means that if you split an index into n
shards, the query on each shard does not take 1/n as much time as the query
applied to the original index.

Most sites, however, have gaps between queries.  This means that
multi-threading the query by sharding up to the number of hyper-threads on
the machine actually improves response time even if not quite linearly.
 Thus having 8, 16 or 24 shards on a single node (depending on processor
and socket count) may be a great idea.

> Our search results grids are currently 70 items.  If someone were to go
> through the results to page 21, they would be asking for a start value of
> 1400.  With 420 shards, the distributed search would have to deal with
> 588000 items.  That's a lot of results to deal with.  The overhead is much
> smaller with 60 shards, but I've seen searches that indicate some dedicated
> individuals will delve a lot deeper than 20 pages.  How much extra memory
> does it take when a distributed search has to deal with a million or more
> results?

Well, as stated above, having such a large number of shards is far from
what I am suggesting.  You are correct that the receiving node will need to
collate a large number of results, but the method I have used in Katta
sends the query to each node and that node sends the query to a thread per
shard.  Results are collated per node and returned to the query source for
final collation.  Thus, in an extreme case you would need to handle 24 x
2000 results = 48,000 items on each node and 2000 results per node in the
cluster in the final collation.  For a large cluster with, say 100 nodes,
that would be 200,000 results.  In the systems I have designed with half
that many nodes, the returned results are simply id's + display snippets.
 Thus, total memory for this relatively extreme query would be < 200MB
which would take several hundred milliseconds to receive with 10Ge
networking and would take about 2 seconds to receive using 1Ge networking.
 This is equivalent to looking at the 100-th page of normal search results
so I wouldn't worry too much about such an extreme corner case.

> I've got an 8GB heap for Solr, which has been more than enough for
> everything but a distributed termsComponent request on my largest field.  I
> don't attempt those any more, it always requires a Solr restart before
> normal queries will resume.
>

Things have changed since I last did this sort of thing seriously.  My
guess is that this is a relatively small amount of memory to devote to
search.  It used to be that the only way to do this effectively with Lucene
based systems was to keep the heap relatively small like you have here and
put the index into a tmpfs mount.  I think better ways are now available
which would keep the index in memory in the search engine itself for better
speed.

One customer that we have now has search engines with 128GB of memory.  He
fills much of that with live index sharded about 10-fold.  In-memory
indexes can run enough faster to be more cost effective than disk based
indexes because you need so many fewer machines to run the searches in the
required response time.

I already have a way to deal with resharding, because I can rebuild one
> copy of my index with an independent new configuration while the other
> stays completely online.  It takes a few hours, of course.

This is a very modest sized index.  With micro-sharding, Hadoop can be used
very effectively for indexing to get substantially faster index times.

> There's overhead with micro-sharding.  The index would get larger, and the
> inherent problems with deep paging in distributed search will be amplified
> by a large increase in shard count.  Are the potential benefits worth
> incurring that overhead?
>

Yes there is overhead, but not nearly of the magnitude you have outlined.
 My guess is that it is less than 2x cost in query time relative to perfect
multi-thread speedup and <20% in size.  These are based on experience, but
not on direct measurement.  These mean that you might get 8x speed up
instead of a predicted 16x speedup.

In my experience, I was able to drop search times dramatically because I
had substantial multi-threading opportunities.  I know that most high end
search engines do something very similar with shard counts of 10-30 per
node.

Here is a good, but dated, reference on the trade-off of using in-memory
indices.  There is an implication in these papers of at least mini-sharding
if not full on micro-sharding.

http://ciir.cs.umass.edu/~strohman/slides/sigir2007-memory.pdf

http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=715

Re: Micro-Sharding

Reply via email to