Your mileage will vary. I find that this exact motion is what makes reduce side indexing very attractive since it allows precise sharding control.
On Thu, Oct 11, 2012 at 10:20 PM, Lance Norskog <[email protected]> wrote: > The problem is finding the documents in the search cluster. Making the > indexes in the reducer means the final m/r pass has to segregate the > documents according to the distribution function. This uses network > bandwidth for Hadoop coordination, while using SolrCloud only pushes > each document once. It's a disk bandwidth v.s. network bandwidth > problem. Although a combiner in the final m/r pass would really speed > up the hadoop shuffle. > > Lance > > From: "Ted Dunning" <[email protected]> > To: [email protected] > Sent: Wednesday, October 10, 2012 11:13:36 PM > Subject: Re: Hadoop/Lucene + Solr architecture suggestions? > > Having solr cloud do the indexing instead of having map-reduce do > the indexing causes substantial write amplification. > > The issue is that cores are replicated in solr cloud. To keep the > cores in sync, all of the replicas index the same documents leading to > amplification. Further amplification occurs when all of the logs get > a copy of the document as well. > > Indexing in map-reduce provides considerably higher latency, but > it drops the write amplification dramatically because only one copy of > the index needs to be created. For MapR, this index is created > directly in the dfs, in vanilla Hadoop you would typically create the > index in local disk and copy to HDFS. The major expense, however, is > the indexing which is nice to do only once. Reduce-side indexing has > the advantage that your indexing bandwidth naturally increases with > your increasing cluster size. > > Deployment of indexes can be done by copying from HDFS, or > directly deploying using NFS from MapR. Either way, all of the shard > replicas appear under the live SolR. Obviously, if you copy from > HDFS, you have some issues with making sure that things appear > correctly. One way to deal with this is to double the copy steps by > copying from HDFS and then using a differential copy to leave files as > unchanged as possible. You want to do that to allow as much of the > memory image to stay live as possible. With NFS and transactional > updates, that isn't an issue, of course. > > On the extreme side, you can host all of the searchers on NFS > hosted index shards. You can have one Solr instance devoted to > indexing each shard. This will cause the shards to update and each > Solr index will detect these changes after a few seconds. This gives > near real-time search with high isolation between indexing and search > loads. > > > On Wed, Oct 10, 2012 at 10:38 PM, JAY <[email protected]> wrote: > > How can you store Solr shards in hadoop? Is each data node > running a Solr server? If so - is the reducer doing a trick to write > to a local fs? > > Sent from my iPad > > On Oct 11, 2012, at 12:04 AM, "M. C. Srivas" <[email protected]> > wrote: > > Interestingly, a few MapR customers have gone the other > way, deliberately having the indexer put the Solr shards directly > into MapR and letting it distribute it. Has made index-management a > cinch. > > Otherwise they do run into what Tim alludes to. > > On Wed, Oct 10, 2012 at 7:27 PM, Tim Williams > <[email protected]> wrote: > > On Wed, Oct 10, 2012 at 10:15 PM, Lance Norskog > <[email protected]> wrote: > > In the LucidWorks Big Data product, we handle this > with a reducer that sends documents to a SolrCloud cluster. This way > the index files are not managed by Hadoop. > > Hi Lance, > I'm curious if you've gotten that to work with a > decent-sized (e.g. > > 250 node) cluster? Even a trivial cluster seems to > crush SolrCloud > from a few months ago at least... > > Thanks, > --tim > > > ----- Original Message ----- > > | From: "Ted Dunning" <[email protected]> > > | To: [email protected] > > | Cc: "Hadoop User" <[email protected]> > > | Sent: Wednesday, October 10, 2012 7:58:57 AM > > | Subject: Re: Hadoop/Lucene + Solr architecture > suggestions? > > | > > | I prefer to create indexes in the reducer personally. > > | > > | Also you can avoid the copies if you use an > advanced hadoop-derived > > | distro. Email me off list for details. > > | > > | Sent from my iPhone > > | > > | On Oct 9, 2012, at 7:47 PM, Mark Kerzner > <[email protected]> > > | wrote: > > | > > | > Hi, > > | > > > | > if I create a Lucene index in each mapper, > locally, then copy them > > | > to under /jobid/mapid1, /jodid/mapid2, and then > in the reducers > > | > copy them to some Solr machine (perhaps even > merging), does such > > | > architecture makes sense, to create a searchable > index with > > | > Hadoop? > > | > > > | > Are there links for similar architectures and > questions? > > | > > > | > Thank you. Sincerely, > > | > Mark > > | >
