On Fri, Jan 15, 2010 at 7:36 PM, Andrzej Bialecki <a...@getopt.org> wrote: > Hi, > > My 0.02 PLN on the subject ... > > Terminology > ----------- > First the terminology: reading your emails I have a feeling that my head is > about to explode. We have to agree on the vocabulary, otherwise we have no > hope of reaching any consensus.
We not only need more standardized terminology for email, but for exact strings to put in zookeeper. > I propose the following vocabulary that has > been in use and is generally understood: > > * (global) search index: a complete collection of all indexed documents. > From a conceptual point of view, this is our complete search space. We're currently using "collection". Notice how you had to add (global) to clarify what you meant. I fear that a sentence like "what index are you querying" would need constant clarification. > * index shard: a non-overlapping part of the search index. When you get down to modeling it, this gets a little squishy and is hard to avoid using two words. Say the complete collection is covered by ShardX and ShardY. A way to model this is like so: /collection /ShardX /node1 [url=..., version=...] /node2 [url=..., version=...] /node3 [url=..., version=...] It becomes clearer that there are logical shards and physical shards. If shards are updateable, they may have different versions at different times. It may also be that all the physical shards go down, but the logical "ShardX" remains. Even the statement "what shard did that response come from" becomes ambiguous since we could be talking a part of the index (ShardX) or we could be talking about the specific physical shard/server (it came from node2). > All shards in the > system form together the complete search space of the search index. E.g. > having initially one big index I could divide it into multiple shards using > MultiPassIndexSplitter, and if I combined all the shards again, using > IndexMerger, I should obtain the original complete search index (modulo > changed Lucene docids .. doesn't matter). I strongly believe in > micro-sharding, because they are much easier to handle and replicate. Also, > since we control the shards we don't have to deal with overlapping shards, > which is the curse of P2P search. Prohibiting overlapping shards effectively prohibits ever merging or splitting shards online (it could only be an offline or blocking operation). Anyway, in the opaque shard model (where clients create shards, and we don't know how they partitioned them), shards would have to be non-overlapping. As far as the future (allocation and rebalancing), I'm happy with a small-shard approach that avoids merging and splitting. It carries some other nice little side benefits as well. > * partitioning: a method whereby we can determine the target shard ID based > on a doc ID. I think we're all using partitioning the same way, but that's a narrower definition than needed. A user may partition the index, and Solr may not have the mapping of docid to shard. You've also used some slightly new terminology... "shard ID" as opposed to just shard, which reinforces the need for different terminology for the physical vs the logical. > * search node: an application that provides search and update to one or more > shards. > > * search host: a machine that may run 1 or more search nodes. > > * Shard Manager: a component that keeps track of allocation of shards to > nodes (plus more, see below). > > Now, to translate this into Solr-speak: depending on the details of the > design, and the evolution of Solr, one search node could be one Solr > instance that manages one shard per core. A solr core is a bit too heavyweight for a microshard though. I think a single solr core really needs to be able to handle multiple shards for this to become practical. > Let's forget here about the > current distributed search component, and the current replication Heh. I think this is what is causing some of the mismatches... different starting points and different assumptions. > - they > could be useful in this design as a raw transport mechanism, but someone > else would be calling the shots (see below). Seems like we need to be flexible in allowing customers to call the shots to varying degrees. -Yonik http://www.lucidimagination.com > Architecture > ------------ > The replication and load balancing is a problem with many existing > solutions, and this one in particular reminds me strongly of the Hadoop > HDFS. In fact, early on during the development of Hadoop [1] I wondered > whether we could reuse HDFS to manage Lucene indexes instead of opaque > blocks of fixed size. It turned out to be infeasible, but the model of > Namenode/Datanode still looks useful in our case, too. > > I believe there are many useful lessons lurking in Hadoop/HBase/Zookeeper > that we could reuse in our design. The following is just a straightforward > port of the Namenode/Datanode concept. > > Let's imagine a component called ShardManager that is responsible for > managing the following data: > > * list of shard ID-s that together form the complete search index, > * for each shard ID, list of search nodes that serve this shard. > * issuing replication requests > * maintaining the partitioning function (see below), so that updates are > directed to correct shards > * maintaining heartbeat to check for dead nodes > * providing search clients with a list of nodes to query in order to obtain > all results from the search index. > > Whenever a new search node comes up, it reports its local shard ID-s > (versioned) to the ShardManager. Based on these reports from the currently > active nodes, the ShardManager builds this mapping of shards to nodes, and > requests replication if some shards are too old, or if the replication count > is too low, allocating these shards to selected nodes (based on a policy of > some kind). > > I believe most of the above functionality could be facilitated by Zookeeper, > including the election of the node that runs the ShardManager. > > Updates > ------- > We need a partitioning schema that splits documents more or less evenly > among shards, and at the same time allows us to split or merge unbalanced > shards. The simplest function that we could imagine is the following: > > hash(docId) % numShards > > though this has the disadvantage that any larger update will affect multiple > shards, thus creating an avalanche of replication requests ... so a > sequential model would be probably better, where ranges of docIds are > assigned to shards. > > Now, if any particular shard is too unbalanced, e.g. too large, it could be > further split in two halves, and the ShardManager would have to record this > exception. This is a very similar process to a region split in HBase, or a > page split in btree DBs. Conversely, shards that are too small could be > joined. This is the icing on the cake, so we can leave it for later. > > After commit, a node contacts the ShardManager to report a new version of > the shard. ShardManager issues replication requests to other nodes that hold > a replica of this shard. > > Search > ------ > There should be a component sometimes referred to as query integrator (or > search front-end) that is the entry and exit point for user search requests. > On receiving a search request this component gets a list of randomly > selected nodes from SearchManager to contact (the list containing all shards > that form the global index), sends the query and integrates partial results > (under a configurable policy for timeouts/early termination), and sends back > the assembled results to the user. > > Again, somewhere in the background the knowledge of who to contact should be > handled by Zookeeper. > > That's it for now from the top of my head ... > > ----------- > > [1] > http://www.mail-archive.com/nutch-develop...@lists.sourceforge.net/msg02273.html > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com