On Wed, Feb 10, 2010 at 8:04 AM, Yonik Seeley <yo...@lucidimagination.com> wrote: > On Wed, Feb 10, 2010 at 12:02 AM, Jon Gifford <jon.giff...@gmail.com> wrote: >> Alternatively, I could create a collection per customer, which removes >> the need for slices, but means duplicating the schema many times. > > Multiple collections should be able to share a single config (schema > and related config files).
OK, this solves the top level problem (how do I manage a single customers index, while guaranteeing that all indices have the same schema), which is good. > Note: I've backed off of the use of "slice" in the public APIs since > it was contentious (although I still think it's a useful concept and > it does remain in some of the code). "shard" is kind of ambiguous, > but people are pretty good at dealing with ambiguity (and removing > that ambiguity by introducing another term seemed to add more > perceived complexity). I agree that its a very useful concept, and wonder how much of the contention is just a terminology issue? If we used subcollection instead, the intent becomes clearer for some use-cases. If we used tag, or taggroup, then a slightly different (more powerful?) intent is suggested. >> The second part of what I need is to be able to search a single >> customers index, which I'm assuming will be a slice. Something like: >> >> >> http://localhost:8983/solr/collection1/select?distrib=true&slice=customer_1 > > The URLs on the SolrCloud page have been updated - this would now be > http://localhost:8983/solr/collection1/select?distrib=true&shards=customer_1 > > This will work as long as no customer becomes bigger than a shard. If > that's not the case, you could query the entire collection and filter > on customer_1, or create a collection per customer (or do both, if you > mave many small customers that you want to pack in a single shard). right. I'd most likely default to using a collection per customer (assuming that collections can share a single config) because a single customers index will be larger than a single shard. > > http://localhost:8983/solr/collection1/select?distrib=true&collection=customer_1 > >> Reading over some of the previous discussions, slices seem to be >> somewhat contentious, and I wanted to chime in on them a bit here. It >> seems to me that slices are loosely defined, and I think thats a good >> thing. If you think of slices as being similar to tags, then its easy >> to imagine that any given shard can belong to many different slices. > > I wouldn't call it a "slice" but I've also been thinking about how to > select groups of nodes. > Extending that to shards would also make sense. I think the important points here are that if there is the concept of a group (or slice or subcollection or tag - whatever terminology we end up using), then 1) the client (typically some front end code) can use a simpler interface, which I think is a good thing. Solr doesn't need to expose how many shards there really are, or what they're named, and the FE doesn't have to try and generate a list of shard id's just to do a search. 2) Some piece of code has to decide what shards to actually search, and that piece of code has to know exactly what shards actually exist. If that decision is made in the client, then it has to be made in every client (your customer-facing search interface, any and all background tasks you have running, any ad-hoc searches you do for analysis or spot checking or...). For the sake of simplicity and sanity, you don't want to have to replicate that decision making code across multiple apps or languages. 3) the collection and shard entities are at opposite ends of a fairly wide divide, and there are cases where you need something "in-between". In most cases, a simple collection search will suffice, but in those cases where you want to limit the search to particular shards, it makes more sense to me to manage that set of shards within solr, and expose only the fact that the "groups" are available. Here's another example: Lets say you're generating hourly shards, to limit the maximum size of the shard that is taking updates, for performance reasons. Lets also assume that you want to roll those hourlies up into daily or weekly or maximum size shards once they become less active, so Solr isn't trying to search 24 shards to get a single days worth of results. If the "group" concept exists, then you can hide all of the mechanics of how and when that happens from the client, while still allowing it to have some control over how far back it can search, by exposing "groups" that limit it to the last day or week or whatever makes sense for your app. cheers Jon > > -Yonik > http://www.lucidimagination.com >