RE: Hbase - Solr Integration

Andrew Hu Fri, 30 Sep 2011 16:01:28 -0700

Thanks Drew for your suggestions and ideas, very helpful.

-Andrew


> Date: Fri, 30 Sep 2011 10:17:50 -0400
> Subject: Re: Hbase - Solr Integration
> From: [email protected]
> To: [email protected]
> 
> Hi David,
> 
> I did a little proof of concept a few weeks ago indexing hundreds of
> millions of rows from hbase in solr using the near real time stuff in
> solr's trunk.
> 
> You *could* write map reduce jobs against hbase to generate lucene
> indexes on a periodic basis if you want, but that's not going to be
> real time in the least. If that interested you, take a peek at the
> source code for Katta.
> 
> Like you, I wanted updates to be indexed in near real time. At the
> time of writing, they haven't made a point release of Solr that
> includes the near real time code that came out of twitter. It's been
> merged into trunk and is actually quite stable. Check out trunk,
> compile it, and then configure the near real time stuff. They've
> introduced the concept of 'soft commits' which make new documents
> available to the index in near real time without all the overhead of
> flushing to disk (hard commit). In my case, I set it to automatically
> soft commit once a second and hard commit once an hour.
> 
> There's nothing hbase specific about my test. I just added some code
> to CC solr on writes I do to hbase using solr's rest api.
> 
> Each document in my test was quite small <1k. I had 1 ec2 large
> instance running solr and a hbase row scanner iterating over a table
> posting documents to solr as fast as it could. When the index was
> small, the indexing speed was a draw dropping 3500 document
> additions/sec. As the index grew to ~50million it had tapered off to
> 800/sec. The key to keeping things fast is to keep individual indexes
> small. Solr's answer to this is running multiple 'cores'. It's
> basically a rest api for sharding your solr index. Maybe you shard it
> 1 core per customer? When querying you can specify multiple cores to
> execute that query against, run multiple cores on a machine, etc.
> 
> I realize sharding solr to match the scalability of a distributed
> database probably doesn't sound very magical. It's a lot of legwork &
> that's exactly what's motivating projects like Elastic Search &
> Lucandra. I experimented with both and sadly those experiments went
> poorly compared to traditional solr.
> 
> Hope that helps,
> Drew
> 
> On Thu, Sep 29, 2011 at 6:37 PM, Andrew Hu <[email protected]> wrote:
> >
> > Hi David,
> >
> > I am currently working with HBase with 100 columns. My requirement is
> > perform real time search on HBase using rowkeys, and these many columns (
> >  all within 1 family only in the schema). Typical query can be SQL type
> > with AND OR NOT operators using these columns. I have ruled out batch 
> > processing, such as
> > Hive. My question is:
> >
> > - HBase + Solr will probably give you
> > better query speed, but you need to maintain the both clusters, pushing
> > data from HBase to Solr, and perhaps update Solr index pretty frequently.
> > - Using HBase only and search needs to be
> > against all of these columns, you need to either build secondary indexes
> >  for each of the column ( if master table is 1 million rows, you will
> > end up with 100 millions row + 1 million of original master table,
> > which will use quite a lot of space), but I suppose search can be done
> > pretty fast as well ?
> >
> > Not sure what is the best approach, any suggestions ?
> >
> >
> > Thanks
> >
> > -Andrew
> >
> >> From: [email protected]
> >> To: [email protected]
> >> Date: Thu, 29 Sep 2011 08:38:12 -0700
> >> Subject: RE: Hbase - Solr Integration
> >>
> >> It sounds like you should investigate the Lily Project.  They have already 
> >> done a lot of work to integrate Solr and HBase into a single solution.  I 
> >> did something similar before they released their project -- I like my use 
> >> of dynamic schema's, but their overall approach is probably more solid.  
> >> In particular they have given careful consideration as to what to do with 
> >> large objects, and how to integrate them into the system.  And most 
> >> importantly, their project is open.
> >>
> >> There was also some talk earlier of integrating HBase and Solr -- you 
> >> might want to search the list for some of Jason's posts.  I think that is 
> >> a work in progress still.
> >>
> >> Otherwise you will have to roll your own solution.  It is actually not too 
> >> difficult to set up a system to publish HBase contents to Solr.  The 
> >> difficulty is in maintaining a consistent view of the data between the 
> >> two.  I believe Lily uses queues to keep updates in sync.  If you can 
> >> tolerate some delay, you could simply update your indexes on a regular 
> >> basis, or set up your application to populate HBase and Solr 
> >> simultaneously.  The biggest challenge is resharding.  HBase will 
> >> automatically split regions when they become too large.  Solr doesn't have 
> >> that capability yet, so you will have to manage the shards yourself.
> >>
> >> Another approach is to look at Elastic Search. That is a Lucene based 
> >> system that does do automatic sharding.
> >>
> >> Direct search on HBase requires either a clever key encoding (like 
> >> OpenTSDB), and/or multiple copies of the data to imitate secondary indexes.
> >>
> >> Dave
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Stuti Awasthi [mailto:[email protected]]
> >> Sent: Thursday, September 29, 2011 2:52 AM
> >> To: [email protected]
> >> Subject: Hbase - Solr Integration
> >>
> >> Hi Friends,
> >>
> >> I am storing my data in Hbase. I want to do search using Solr. I can't 
> >> find much documentation about the integration. Is there any documentation 
> >> to integrate these two.
> >>
> >> Please Suggest
> >>
> >> Regards,
> >> Stuti Awasthi
> >>
> >> ::DISCLAIMER::
> >> -----------------------------------------------------------------------------------------------------------------------
> >>
> >> The contents of this e-mail and any attachment(s) are confidential and 
> >> intended for the named recipient(s) only.
> >> It shall not attach any liability on the originator or HCL or its 
> >> affiliates. Any views or opinions presented in
> >> this email are solely those of the author and may not necessarily reflect 
> >> the opinions of HCL or its affiliates.
> >> Any form of reproduction, dissemination, copying, disclosure, 
> >> modification, distribution and / or publication of
> >> this message without the prior written consent of the author of this 
> >> e-mail is strictly prohibited. If you have
> >> received this email in error please delete it and notify the sender 
> >> immediately. Before opening any mail and
> >> attachments please check them for viruses and defect.
> >>
> >> -----------------------------------------------------------------------------------------------------------------------
> >

RE: Hbase - Solr Integration

Reply via email to