Bin: The MRIT/Morphlines only makes sense if you have lots more nodes devoted to the M/R jobs than you do Solr shards since the actual work done to index a given doc is exactly the same either with MRIT/Morphlines or just sending straight to Solr.
A bit of background here. I mentioned that MRIT/Morphlines uses EmbeddedSolrServer. This is exactly Solr as far as the actual indexing is concerned. So using --go-live is not buying you anything and, in fact, is costing you quite a bit over just using <2> to index directly to Solr since the index has to be copied around. I confess I'm surprised that --go-live is taking that long. basically it's just copying your index up to Solr so perhaps there's an I/O problem or some such. OK, I'm lying a little bit here, _if_ you have more than one replica per shard, then indexing straight to Solr will cost you (anecdotally) 10-15% in indexing speed. But if this is a single replica/shard (i.e. leader-only), then it's near enough to being the exact same. Anyway, at the end of the day, the index produced is self-contained. You could even just copy it to your shards (with Solr down), and then bring up your Solr nodes on a non-HDFS-based Solr. But frankly I'd avoid that and benchmark on <2> first. My expectation is that you'll be fine there and see indexing roughly on par with your MRIT/Morphlines. Now, all that said, indexing 300M docs in 'a few minutes' is a bit surprising. I'm really wondering if you're not being fooled by something "odd". Have you compared the identical runs with and without --go-live? _Very_ often, the bottleneck isn't Solr at all, it's the data acquisition, so be careful when measuring that the Solr CPU's are pegged... otherwise you're bottlenecking upstream of Solr. A super-simple way to figure that out is to comment out the solrServer.add(list, 10000) line in <2> or just run MRIT/Morphlines without the --go-live switch. BTW, with <2> you could run with as many jobs as you wanted to run the Solr servers flat-out. FWIW, Erick On Mon, Mar 7, 2016 at 1:14 PM, Bin Wang <binwang...@gmail.com> wrote: > Hi Eric, > > Thanks for your quick response. > > From the data's perspective, we have 300+ million rows and believe it or > not, the source data is from relational database (Hive) and the database is > rebuilt every day (I am as frustrated as most of you who read this but it > is what it is) and potentially need to store actually all of the fields. > In this case, I have to figure out a solution to quickly index 300+ million > rows as fast as I can. > > I am still at a stage evaluating all the different solutions, and I am > sorry that I haven't really benchmarked the second approach yet. > I will find a time to run some benchmark and share the result with the > community. > > Regarding the approach that I suggested - mapreduce Lucene indexes, do you > think it is feasible and does that worth the effort to dive into? > > Best regards, > > Bin > > > > On Mon, Mar 7, 2016 at 1:57 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> I'm wondering if you need map reduce at all ;)... >> >> The achilles heel with M/R viz: Solr is all the copying around >> that's done at the end of the cycle. For really large bulk indexing >> jobs, that's a reasonable price to pay.. >> >> How many docs and how would you characterize them as far >> as size, fields, etc? And what are your time requirements? What >> kind of docs? >> >> I'm thinking this may be an "XY Problem". You're asking about >> a specific solution before explaining the problem. >> >> Why do you say that Solr is not really optimized for bulk loading? >> I took a quick look at <2> and the approach is sound. It batches >> up the docs in groups of 1,000 and uses CloudSolrServer as it should. >> Have you tried it? At the end of the day, MapReduceIndexerTool does >> the same work to index a doc as a regular Solr server would via >> EmbeddedSolrServer so if the number of tasks you have running is >> roughly equal to the number of shards, it _should_ be roughly >> comparable. >> >> Still, though, I have to repeat my question about how many docs you're >> talking here. Using M/R inevitably adds complexity, what are you trying >> to gain here that you can't get with several threads in a SolrJ client? >> >> Best, >> Erick >> >> On Mon, Mar 7, 2016 at 12:28 PM, Bin Wang <binwang...@gmail.com> wrote: >> > Hi there, >> > >> > I have a fairly big data set that I need to quick index into Solrcloud. >> > >> > I have done some research and none of them looked really good to me. >> > >> > (1) Kite Morphline: I managed to get it working, the mapreduce finished >> in >> > a few minutes which is good, however, it took a really long time, like >> one >> > hour (60 million), to merge the indexes into Solrcloud, the go-live part. >> > >> > (2) Mapreduce Using Solrcloud Server: >> > < >> http://techuserhadoop.blogspot.com/2014/09/mapreduce-job-for-indexing-documents-to.html >> > >> > this >> > approach is pretty straightforward, however, every document has to funnel >> > through the solrserver which is really not optimized for bulk loading. >> > >> > Here is what I am thinking, is it possible to use Mapreduce to create a >> few >> > Lucene indexes first, for example, using 3 reducers to write three >> indexes. >> > Then create a Solr collection with three shards pointing to the generated >> > indexes. Can Solr easily pick up generated indexes? >> > >> > I am really new to Solr and wondering if this is feasible, and if there >> is >> > any work that has already been done. I am not really interested in >> cutting >> > the edge and any existing work should be appreciated! >> > >> > Best regards, >> > >> > Bin >>