hmmm.....This doesn't sound like the hashing algorithm that's on the branch, right? The algorithm you're mentioning sounds like there is some logic which is able to tell that a particular range should be distributed between 2 shards instead of 1. So seems like a trade off between repartitioning the entire index (on every shard) and having a custom hashing algorithm which is able to handle the situation where 2 or more shards map to a particular range.
On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <markrmil...@gmail.com> wrote: > > On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: > >> I am not familiar with the index splitter that is in contrib, but I'll >> take a look at it soon. So the process sounds like it would be to run >> this on all of the current shards indexes based on the hash algorithm. > > Not something I've thought deeply about myself yet, but I think the idea > would be to split as many as you felt you needed to. > > If you wanted to keep the full balance always, this would mean splitting > every shard at once, yes. But this depends on how many boxes (partitions) you > are willing/able to add at a time. > > You might just split one index to start - now it's hash range would be > handled by two shards instead of one (if you have 3 replicas per shard, this > would mean adding 3 more boxes). When you needed to expand again, you would > split another index that was still handling its full starting range. As you > grow, once you split every original index, you'd start again, splitting one > of the now half ranges. > >> Is there also an index merger in contrib which could be used to merge >> indexes? I'm assuming this would be the process? > > You can merge with IndexWriter.addIndexes (Solr also has an admin command > that can do this). But I'm not sure where this fits in? > > - Mark > >> >> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <markrmil...@gmail.com> wrote: >>> Not yet - we don't plan on working on this until a lot of other stuff is >>> working solid at this point. But someone else could jump in! >>> >>> There are a couple ways to go about it that I know of: >>> >>> A more long term solution may be to start using micro shards - each index >>> starts as multiple indexes. This makes it pretty fast to move mirco shards >>> around as you decide to change partitions. It's also less flexible as you >>> are limited by the number of micro shards you start with. >>> >>> A more simple and likely first step is to use an index splitter . We >>> already have one in lucene contrib - we would just need to modify it so >>> that it splits based on the hash of the document id. This is super >>> flexible, but splitting will obviously take a little while on a huge index. >>> The current index splitter is a multi pass splitter - good enough to start >>> with, but most files under codec control these days, we may be able to make >>> a single pass splitter soon as well. >>> >>> Eventually you could imagine using both options - micro shards that could >>> also be split as needed. Though I still wonder if micro shards will be >>> worth the extra complications myself... >>> >>> Right now though, the idea is that you should pick a good number of >>> partitions to start given your expected data ;) Adding more replicas is >>> trivial though. >>> >>> - Mark >>> >>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <jej2...@gmail.com> wrote: >>> >>>> Another question, is there any support for repartitioning of the index >>>> if a new shard is added? What is the recommended approach for >>>> handling this? It seemed that the hashing algorithm (and probably >>>> any) would require the index to be repartitioned should a new shard be >>>> added. >>>> >>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <jej2...@gmail.com> wrote: >>>>> Thanks I will try this first thing in the morning. >>>>> >>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmil...@gmail.com> >>>> wrote: >>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <jej2...@gmail.com> >>>> wrote: >>>>>> >>>>>>> I am currently looking at the latest solrcloud branch and was >>>>>>> wondering if there was any documentation on configuring the >>>>>>> DistributedUpdateProcessor? What specifically in solrconfig.xml needs >>>>>>> to be added/modified to make distributed indexing work? >>>>>>> >>>>>> >>>>>> >>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in >>>>>> solr/core/src/test-files >>>>>> >>>>>> You need to enable the update log, add an empty replication handler def, >>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it. >>>>>> >>>>>> -- >>>>>> - Mark >>>>>> >>>>>> http://www.lucidimagination.com >>>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> - Mark >>> >>> http://www.lucidimagination.com >>> > > - Mark Miller > lucidimagination.com > > > > > > > > > > > >