Hi Eric,

Thanks for your quick response.

>From the data's perspective, we have 300+ million rows and believe it or
not, the source data is from relational database (Hive) and the database is
rebuilt every day (I am as frustrated as most of you who read this but it
is what it is) and potentially need to store actually all of the fields.
In this case, I have to figure out a solution to quickly index 300+ million
rows as fast as I can.

I am still at a stage evaluating all the different solutions, and I am
sorry that I haven't really benchmarked the second approach yet.
I will find a time to run some benchmark and share the result with the
community.

Regarding the approach that I suggested - mapreduce Lucene indexes, do you
think it is feasible and does that worth the effort to dive into?

Best regards,

Bin



On Mon, Mar 7, 2016 at 1:57 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> I'm wondering if you need map reduce at all ;)...
>
> The achilles heel with M/R viz: Solr is all the copying around
> that's done at the end of the cycle. For really large bulk indexing
> jobs, that's a reasonable price to pay..
>
> How many docs and how would you characterize them as far
> as size, fields, etc? And what are your time requirements? What
> kind of docs?
>
> I'm thinking this may be an "XY Problem". You're asking about
> a specific solution before explaining the problem.
>
> Why do you say that Solr is not really optimized for bulk loading?
> I took a quick look at <2> and the approach is sound. It batches
> up the docs in groups of 1,000 and uses CloudSolrServer as it should.
> Have you tried it? At the end of the day, MapReduceIndexerTool does
> the same work to index a doc as a regular Solr server would via
> EmbeddedSolrServer so if the number of tasks you have running is
> roughly equal to the number of shards, it _should_ be roughly
> comparable.
>
> Still, though, I have to repeat my question about how many docs you're
> talking here. Using M/R inevitably adds complexity, what are you trying
> to gain here that you can't get with several threads in a SolrJ client?
>
> Best,
> Erick
>
> On Mon, Mar 7, 2016 at 12:28 PM, Bin Wang <binwang...@gmail.com> wrote:
> > Hi there,
> >
> > I have a fairly big data set that I need to quick index into Solrcloud.
> >
> > I have done some research and none of them looked really good to me.
> >
> > (1) Kite Morphline: I managed to get it working, the mapreduce finished
> in
> > a few minutes which is good, however, it took a really long time, like
> one
> > hour (60 million), to merge the indexes into Solrcloud, the go-live part.
> >
> > (2) Mapreduce Using Solrcloud Server:
> > <
> http://techuserhadoop.blogspot.com/2014/09/mapreduce-job-for-indexing-documents-to.html
> >
> > this
> > approach is pretty straightforward, however, every document has to funnel
> > through the solrserver which is really not optimized for bulk loading.
> >
> > Here is what I am thinking, is it possible to use Mapreduce to create a
> few
> > Lucene indexes first, for example, using 3 reducers to write three
> indexes.
> > Then create a Solr collection with three shards pointing to the generated
> > indexes. Can Solr easily pick up generated indexes?
> >
> > I am really new to Solr and wondering if this is feasible, and if there
> is
> > any work that has already been done. I am not really interested in
> cutting
> > the edge and any existing work should be appreciated!
> >
> > Best regards,
> >
> > Bin
>

Reply via email to