On Wed, Feb 10, 2010 at 8:04 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Wed, Feb 10, 2010 at 12:02 AM, Jon Gifford <jon.giff...@gmail.com> wrote:
>> Alternatively, I could create a collection per customer, which removes
>> the need for slices, but means duplicating the schema many times.
>
> Multiple collections should be able to share a single config (schema
> and related config files).

OK, this solves the top level problem (how do I manage a single
customers index, while guaranteeing that all indices have the same
schema), which is good.

> Note: I've backed off of the use of "slice" in the public APIs since
> it was contentious (although I still think it's a useful concept and
> it does remain in some of the code).  "shard" is kind of ambiguous,
> but people are pretty good at dealing with ambiguity (and removing
> that ambiguity by introducing another term seemed to add more
> perceived complexity).

I agree that its a very useful concept, and wonder how much of the
contention is just a terminology issue? If we used subcollection
instead, the intent becomes clearer for some use-cases. If we used
tag, or taggroup, then a slightly different (more powerful?) intent is
suggested.

>> The second part of what I need is to be able to search a single
>> customers index, which I'm assuming will be a slice. Something like:
>>
>>    
>> http://localhost:8983/solr/collection1/select?distrib=true&slice=customer_1
>
> The URLs on the SolrCloud page have been updated - this would now be
> http://localhost:8983/solr/collection1/select?distrib=true&shards=customer_1
>
> This will work as long as no customer becomes bigger than a shard.  If
> that's not the case, you could query the entire collection and filter
> on customer_1, or create a collection per customer (or do both, if you
> mave many small customers that you want to pack in a single shard).

right. I'd most likely default to using a collection per customer
(assuming that collections can share a single config) because a single
customers index will be larger than a single shard.

>
> http://localhost:8983/solr/collection1/select?distrib=true&collection=customer_1
>
>> Reading over some of the previous discussions, slices seem to be
>> somewhat contentious, and I wanted to chime in on them a bit here. It
>> seems to me that slices are loosely defined, and I think thats a good
>> thing. If you think of slices as being similar to tags, then its easy
>> to imagine that any given shard can belong to many different slices.
>
> I wouldn't call it a "slice" but I've also been thinking about how to
> select groups of nodes.
> Extending that to shards would also make sense.

I think the important points here are that if there is the concept of
a group (or slice or subcollection or tag - whatever terminology we
end up using), then

1)  the client (typically some front end code) can use a simpler
interface, which I think is a good thing. Solr doesn't need to expose
how many shards there really are, or what they're named, and the FE
doesn't have to try and generate a list of shard id's just to do a
search.

2) Some piece of code has to decide what shards to actually search,
and that piece of code has to know exactly what shards actually exist.
If that decision is made in the client, then it has to be made in
every client (your customer-facing search interface, any and all
background tasks you have running, any ad-hoc searches you do for
analysis or spot checking or...). For the sake of simplicity and
sanity, you don't want to have to replicate that decision making code
across multiple apps or languages.

3) the collection and shard entities are at opposite ends of a fairly
wide divide, and there are cases where you need something
"in-between".

In most cases, a simple collection search will suffice, but in those
cases where you want to limit the search to particular shards, it
makes more sense to me to manage that set of shards within solr, and
expose only the fact that the "groups" are available.

Here's another example:

Lets say you're generating hourly shards, to limit the maximum size of
the shard that is taking updates, for performance reasons. Lets also
assume that you want to roll those hourlies up into daily or weekly or
maximum size shards once they become less active, so Solr isn't trying
to search 24 shards to get a single days worth of results. If the
"group" concept exists, then you can hide all of the mechanics of how
and when that happens from the client, while still allowing it to have
some control over how far back it can search, by exposing "groups"
that limit it to the last day or week or whatever makes sense for your
app.

cheers

Jon


>
> -Yonik
> http://www.lucidimagination.com
>

Reply via email to