On Wed, Feb 10, 2010 at 10:47 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > This is analogous to the multiple data sources that we have at > Deepdyve<http://www.deepdyve.com>. > In a fully sharded and balanced environment, I have found it *much* more > efficient to put all data sources into a single collection and use a filter > to select one or the other.
I suspect I'll be forced into the position of combining some subsets of customers if (when?) multi-core performance becomes an issue, but even if thats true, there are still likely to be customers that (because of input volume) require their own collection. So I think the capability to do what I'm suggesting is still key to me. To give you an idea of what I mean by "input volume", we're talking to potential customers who have streams of 15-20k "documents" per second, coming in 24/7, and the hardware we're going to be using can handle something like 4-5k/second per core. So for a customer like that, we'll create a collection and split the input 4 or 5 or 6 ways to make sure we're keeping up to date. > The rationale is that data sources are > distributed in size according to a rough long-tail distribution. For the > largest ones, the filters are about as efficient as a separate index because > they are such a large fraction of the index. For the small ones, the > filtered query is so fast that other issues form the bottleneck anyway. The > operational economies of not managing hundreds of indexes and the much > better load balancing makes the integrated solution massively better for > me. Yeah, I'm a little concerned, to be honest, about the idea of having hundreds or thousands of collections floating around. No matter how you look at it, thats some serious overhead. Combining the "long-tail" customers (i.e. low data rate) into one or more combo-indices makes a fiar amount of sense, and I've been pondering that question for a while now. > We currently use Katta and this system works really, really well. Katta does look nice, but the SolrCloud stuff seems simpler and closer to what I need. We shall see :-) > One big difference in our environments is that for me, the dominant query > pattern involves most data sources while for you, the dominant pattern will > likely involve a single data source. Yeah. In fact, we have a hard requirement that customer X can't see customer Y's data. Other differences are that (1) most customers are interested in the newest data, so we're likely to have a very hot core with the most recent data, and a bunch of cooler ones with historical data that is accessed infrequently, and (2) we don't expect that search load will be either high or continuous, since we're primarily serving as a forensics tool, so will be very bursty for each customer. > > On Tue, Feb 9, 2010 at 9:02 PM, Jon Gifford <jon.giff...@gmail.com> wrote: > >> 1) Support one index per customer, and many customers (thus, many >> independent indices) >> > > > > -- > Ted Dunning, CTO > DeepDyve >