On 2010-01-16 18:18, Yonik Seeley wrote:
On Fri, Jan 15, 2010 at 7:36 PM, Andrzej Bialecki<a...@getopt.org> wrote:
Hi,
My 0.02 PLN on the subject ...
Terminology
-----------
First the terminology: reading your emails I have a feeling that my head is
about to explode. We have to agree on the vocabulary, otherwise we have no
hope of reaching any consensus.
We not only need more standardized terminology for email, but for
exact strings to put in zookeeper.
Indeed.
I propose the following vocabulary that has
been in use and is generally understood:
* (global) search index: a complete collection of all indexed documents.
From a conceptual point of view, this is our complete search space.
We're currently using "collection". Notice how you had to add
(global) to clarify what you meant. I fear that a sentence like "what
index are you querying" would need constant clarification.
I avoided the word "collection", because Solr deploys various cores
under "collectionX" names, leading users to assume that core ==
collection. "Global index" is two words but it's unambiguous. I'm fine
with the "collection" if we clarify the definition and avoid using this
term for other stuff.
* index shard: a non-overlapping part of the search index.
When you get down to modeling it, this gets a little squishy and is
hard to avoid using two words.
Say the complete collection is covered by ShardX and ShardY.
A way to model this is like so:
/collection
/ShardX
/node1 [url=..., version=...]
/node2 [url=..., version=...]
/node3 [url=..., version=...]
It becomes clearer that there are logical shards and physical shards.
If shards are updateable, they may have different versions at
different times.
Yes, but they are supposed to be ultimately consistent - that's where
the replication comes in.
It may also be that all the physical shards go down,
but the logical "ShardX" remains.
Yes, as a missing piece of the global index not served currently by any
node, thus leading to incomplete results.
Even the statement "what shard did that response come from" becomes
ambiguous since we could be talking a part of the index (ShardX) or we
could be talking about the specific physical shard/server (it came
from node2).
Agreed - but it could be as simple as qualifying this with "from shardX
on node2".
This would be quite natural if you consider that even the same query
submitted again could be answered by a different set of nodes that
manage the same set of shards. E.g. with two nodes {n1, n2} and 2 shards
{s1,s2}, and the replication factor of 2, the selection of what shard on
what node contributes to the list of results could look like this (time
in the Y axis):
q1 {n1:s1,n2:s2}
q2 {n1:s2,n2:s1}
...
All shards in the
system form together the complete search space of the search index. E.g.
having initially one big index I could divide it into multiple shards using
MultiPassIndexSplitter, and if I combined all the shards again, using
IndexMerger, I should obtain the original complete search index (modulo
changed Lucene docids .. doesn't matter). I strongly believe in
micro-sharding, because they are much easier to handle and replicate. Also,
since we control the shards we don't have to deal with overlapping shards,
which is the curse of P2P search.
Prohibiting overlapping shards effectively prohibits ever merging or
splitting shards online (it could only be an offline or blocking
operation). Anyway, in the opaque shard model (where clients create
shards, and we don't know how they partitioned them), shards would
have to be non-overlapping.
The opaque model means it's more difficult to support updates. IMHO it
makes sense to start with a set of stricter assumptions in order to
build something workable, and then relax them as we gain experience.
As far as the future (allocation and rebalancing), I'm happy with a
small-shard approach that avoids merging and splitting. It carries
some other nice little side benefits as well.
* partitioning: a method whereby we can determine the target shard ID based
on a doc ID.
I think we're all using partitioning the same way, but that's a
narrower definition than needed.
A user may partition the index, and Solr may not have the mapping of
docid to shard.
See above - of course this would be cool and extra convenient to users,
but much more difficult to implement so that it supports updates.
You've also used some slightly new terminology... "shard ID" as
opposed to just shard, which reinforces the need for different
terminology for the physical vs the logical.
You got me ;) yes, when I say "shard" I mean the logical entity, as
defined by a set of documents - physical shard I would call a replica.
Now, to translate this into Solr-speak: depending on the details of the
design, and the evolution of Solr, one search node could be one Solr
instance that manages one shard per core.
A solr core is a bit too heavyweight for a microshard though.
I think a single solr core really needs to be able to handle multiple
shards for this to become practical.
Ok. This is actually related to the issue below (witness SOLR-1366).
Let's forget here about the
current distributed search component, and the current replication
Heh. I think this is what is causing some of the mismatches...
different starting points and different assumptions.
They work well with the current assumptions, and are known to work
poorly with the design that we are discussing.
- they
could be useful in this design as a raw transport mechanism, but someone
else would be calling the shots (see below).
Seems like we need to be flexible in allowing customers to call the
shots to varying degrees.
Eventually, yes - but initially I fear we won't be able to come up with
a model that allows this much flexibility and is still implementable in
a reasonable time ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com