Re: Solrcloud Batch Indexing

Erick Erickson Mon, 07 Mar 2016 14:41:21 -0800

Bin:

The MRIT/Morphlines only makes sense if you have lots more
nodes devoted to the M/R jobs than you do Solr shards since the
actual work done to index a given doc is exactly the same either
with MRIT/Morphlines or just sending straight to Solr.

A bit of background here. I mentioned that MRIT/Morphlines uses
EmbeddedSolrServer. This is exactly Solr as far as the actual indexing
is concerned. So using --go-live is not buying you anything and, in fact,
is costing you quite a bit over just using <2> to index directly to Solr since
the index has to be copied around. I confess I'm surprised that --go-live
is taking that long. basically it's just copying your index up to Solr so
perhaps there's an I/O problem or some such.

OK, I'm lying a little bit here, _if_ you have more than one replica per
shard, then indexing straight to Solr will cost you (anecdotally)
10-15% in indexing speed. But if this is a single replica/shard (i.e.
leader-only), then it's near enough to being the exact same.

Anyway, at the end of the day, the index produced is self-contained.
You could even just copy it to your shards (with Solr down), and then
bring up your Solr nodes on a non-HDFS-based Solr.

But frankly I'd avoid that and benchmark on <2> first. My expectation
is that you'll be fine there and see indexing roughly on par with your
MRIT/Morphlines.

Now, all that said, indexing 300M docs in 'a few minutes' is a bit surprising.
I'm really wondering if you're not being fooled by something "odd". Have
you compared the identical runs with and without --go-live?

_Very_ often, the bottleneck isn't Solr at all, it's the data acquisition, so be
careful when measuring that the Solr CPU's are pegged... otherwise
you're bottlenecking upstream of Solr. A super-simple way to figure that
out is to comment out the solrServer.add(list, 10000) line in <2> or just
run MRIT/Morphlines without the --go-live switch.

BTW, with <2> you could run with as many jobs as you wanted to run
the Solr servers flat-out.

FWIW,
Erick

On Mon, Mar 7, 2016 at 1:14 PM, Bin Wang <binwang...@gmail.com> wrote:
> Hi Eric,
>
> Thanks for your quick response.
>
> From the data's perspective, we have 300+ million rows and believe it or
> not, the source data is from relational database (Hive) and the database is
> rebuilt every day (I am as frustrated as most of you who read this but it
> is what it is) and potentially need to store actually all of the fields.
> In this case, I have to figure out a solution to quickly index 300+ million
> rows as fast as I can.
>
> I am still at a stage evaluating all the different solutions, and I am
> sorry that I haven't really benchmarked the second approach yet.
> I will find a time to run some benchmark and share the result with the
> community.
>
> Regarding the approach that I suggested - mapreduce Lucene indexes, do you
> think it is feasible and does that worth the effort to dive into?
>
> Best regards,
>
> Bin
>
>
>
> On Mon, Mar 7, 2016 at 1:57 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> I'm wondering if you need map reduce at all ;)...
>>
>> The achilles heel with M/R viz: Solr is all the copying around
>> that's done at the end of the cycle. For really large bulk indexing
>> jobs, that's a reasonable price to pay..
>>
>> How many docs and how would you characterize them as far
>> as size, fields, etc? And what are your time requirements? What
>> kind of docs?
>>
>> I'm thinking this may be an "XY Problem". You're asking about
>> a specific solution before explaining the problem.
>>
>> Why do you say that Solr is not really optimized for bulk loading?
>> I took a quick look at <2> and the approach is sound. It batches
>> up the docs in groups of 1,000 and uses CloudSolrServer as it should.
>> Have you tried it? At the end of the day, MapReduceIndexerTool does
>> the same work to index a doc as a regular Solr server would via
>> EmbeddedSolrServer so if the number of tasks you have running is
>> roughly equal to the number of shards, it _should_ be roughly
>> comparable.
>>
>> Still, though, I have to repeat my question about how many docs you're
>> talking here. Using M/R inevitably adds complexity, what are you trying
>> to gain here that you can't get with several threads in a SolrJ client?
>>
>> Best,
>> Erick
>>
>> On Mon, Mar 7, 2016 at 12:28 PM, Bin Wang <binwang...@gmail.com> wrote:
>> > Hi there,
>> >
>> > I have a fairly big data set that I need to quick index into Solrcloud.
>> >
>> > I have done some research and none of them looked really good to me.
>> >
>> > (1) Kite Morphline: I managed to get it working, the mapreduce finished
>> in
>> > a few minutes which is good, however, it took a really long time, like
>> one
>> > hour (60 million), to merge the indexes into Solrcloud, the go-live part.
>> >
>> > (2) Mapreduce Using Solrcloud Server:
>> > <
>> http://techuserhadoop.blogspot.com/2014/09/mapreduce-job-for-indexing-documents-to.html
>> >
>> > this
>> > approach is pretty straightforward, however, every document has to funnel
>> > through the solrserver which is really not optimized for bulk loading.
>> >
>> > Here is what I am thinking, is it possible to use Mapreduce to create a
>> few
>> > Lucene indexes first, for example, using 3 reducers to write three
>> indexes.
>> > Then create a Solr collection with three shards pointing to the generated
>> > indexes. Can Solr easily pick up generated indexes?
>> >
>> > I am really new to Solr and wondering if this is feasible, and if there
>> is
>> > any work that has already been done. I am not really interested in
>> cutting
>> > the edge and any existing work should be appreciated!
>> >
>> > Best regards,
>> >
>> > Bin
>>

Re: Solrcloud Batch Indexing

Reply via email to