On Wed, Feb 10, 2010 at 10:47 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> This is analogous to the multiple data sources that we have at
> Deepdyve<http://www.deepdyve.com>.
> In a fully sharded and balanced environment, I have found it *much* more
> efficient to put all data sources into a single collection and use a filter
> to select one or the other.

I suspect I'll be forced into the position of combining some subsets
of customers if (when?) multi-core performance becomes an issue, but
even if thats true, there are still likely to be customers that
(because of input volume) require their own collection. So I think the
capability to do what I'm suggesting is still key to me. To give you
an idea of what I mean by "input volume", we're talking to potential
customers who have streams of 15-20k "documents" per second, coming in
24/7, and the hardware we're going to be using can handle something
like 4-5k/second per core. So for a customer like that, we'll create a
collection and split the input 4 or 5 or 6 ways to make sure we're
keeping up to date.


> The rationale is that data sources are
> distributed in size according to a rough long-tail distribution.  For the
> largest ones, the filters are about as efficient as a separate index because
> they are such a large fraction of the index.  For the small ones, the
> filtered query is so fast that other issues form the bottleneck anyway.  The
> operational economies of not managing hundreds of indexes and the much
> better load balancing makes the integrated solution massively better for
> me.

Yeah, I'm a little concerned, to be honest, about the idea of having
hundreds or thousands of collections floating around. No matter how
you look at it, thats some serious overhead. Combining the "long-tail"
customers (i.e. low data rate) into one or more combo-indices makes a
fiar amount of sense, and I've been pondering that question for a
while now.

> We currently use Katta and this system works really, really well.

Katta does look nice, but the SolrCloud stuff seems simpler and closer
to what I need. We shall see :-)

> One big difference in our environments is that for me, the dominant query
> pattern involves most data sources while for you, the dominant pattern will
> likely involve a single data source.

Yeah. In fact, we have a hard requirement that customer X can't see
customer Y's data. Other differences are that (1) most customers are
interested in the newest data, so we're likely to have a very hot core
with the most recent data, and a bunch of cooler ones with historical
data that is accessed infrequently, and (2) we don't expect that
search load will be either high or continuous, since we're primarily
serving as a forensics tool, so will be very bursty for each customer.

>
> On Tue, Feb 9, 2010 at 9:02 PM, Jon Gifford <jon.giff...@gmail.com> wrote:
>
>> 1) Support one index per customer, and many customers (thus, many
>> independent indices)
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to