Thanks Drew for your suggestions and ideas, very helpful. -Andrew
> Date: Fri, 30 Sep 2011 10:17:50 -0400 > Subject: Re: Hbase - Solr Integration > From: [email protected] > To: [email protected] > > Hi David, > > I did a little proof of concept a few weeks ago indexing hundreds of > millions of rows from hbase in solr using the near real time stuff in > solr's trunk. > > You *could* write map reduce jobs against hbase to generate lucene > indexes on a periodic basis if you want, but that's not going to be > real time in the least. If that interested you, take a peek at the > source code for Katta. > > Like you, I wanted updates to be indexed in near real time. At the > time of writing, they haven't made a point release of Solr that > includes the near real time code that came out of twitter. It's been > merged into trunk and is actually quite stable. Check out trunk, > compile it, and then configure the near real time stuff. They've > introduced the concept of 'soft commits' which make new documents > available to the index in near real time without all the overhead of > flushing to disk (hard commit). In my case, I set it to automatically > soft commit once a second and hard commit once an hour. > > There's nothing hbase specific about my test. I just added some code > to CC solr on writes I do to hbase using solr's rest api. > > Each document in my test was quite small <1k. I had 1 ec2 large > instance running solr and a hbase row scanner iterating over a table > posting documents to solr as fast as it could. When the index was > small, the indexing speed was a draw dropping 3500 document > additions/sec. As the index grew to ~50million it had tapered off to > 800/sec. The key to keeping things fast is to keep individual indexes > small. Solr's answer to this is running multiple 'cores'. It's > basically a rest api for sharding your solr index. Maybe you shard it > 1 core per customer? When querying you can specify multiple cores to > execute that query against, run multiple cores on a machine, etc. > > I realize sharding solr to match the scalability of a distributed > database probably doesn't sound very magical. It's a lot of legwork & > that's exactly what's motivating projects like Elastic Search & > Lucandra. I experimented with both and sadly those experiments went > poorly compared to traditional solr. > > Hope that helps, > Drew > > On Thu, Sep 29, 2011 at 6:37 PM, Andrew Hu <[email protected]> wrote: > > > > Hi David, > > > > I am currently working with HBase with 100 columns. My requirement is > > perform real time search on HBase using rowkeys, and these many columns ( > > all within 1 family only in the schema). Typical query can be SQL type > > with AND OR NOT operators using these columns. I have ruled out batch > > processing, such as > > Hive. My question is: > > > > - HBase + Solr will probably give you > > better query speed, but you need to maintain the both clusters, pushing > > data from HBase to Solr, and perhaps update Solr index pretty frequently. > > - Using HBase only and search needs to be > > against all of these columns, you need to either build secondary indexes > > for each of the column ( if master table is 1 million rows, you will > > end up with 100 millions row + 1 million of original master table, > > which will use quite a lot of space), but I suppose search can be done > > pretty fast as well ? > > > > Not sure what is the best approach, any suggestions ? > > > > > > Thanks > > > > -Andrew > > > >> From: [email protected] > >> To: [email protected] > >> Date: Thu, 29 Sep 2011 08:38:12 -0700 > >> Subject: RE: Hbase - Solr Integration > >> > >> It sounds like you should investigate the Lily Project. They have already > >> done a lot of work to integrate Solr and HBase into a single solution. I > >> did something similar before they released their project -- I like my use > >> of dynamic schema's, but their overall approach is probably more solid. > >> In particular they have given careful consideration as to what to do with > >> large objects, and how to integrate them into the system. And most > >> importantly, their project is open. > >> > >> There was also some talk earlier of integrating HBase and Solr -- you > >> might want to search the list for some of Jason's posts. I think that is > >> a work in progress still. > >> > >> Otherwise you will have to roll your own solution. It is actually not too > >> difficult to set up a system to publish HBase contents to Solr. The > >> difficulty is in maintaining a consistent view of the data between the > >> two. I believe Lily uses queues to keep updates in sync. If you can > >> tolerate some delay, you could simply update your indexes on a regular > >> basis, or set up your application to populate HBase and Solr > >> simultaneously. The biggest challenge is resharding. HBase will > >> automatically split regions when they become too large. Solr doesn't have > >> that capability yet, so you will have to manage the shards yourself. > >> > >> Another approach is to look at Elastic Search. That is a Lucene based > >> system that does do automatic sharding. > >> > >> Direct search on HBase requires either a clever key encoding (like > >> OpenTSDB), and/or multiple copies of the data to imitate secondary indexes. > >> > >> Dave > >> > >> > >> > >> -----Original Message----- > >> From: Stuti Awasthi [mailto:[email protected]] > >> Sent: Thursday, September 29, 2011 2:52 AM > >> To: [email protected] > >> Subject: Hbase - Solr Integration > >> > >> Hi Friends, > >> > >> I am storing my data in Hbase. I want to do search using Solr. I can't > >> find much documentation about the integration. Is there any documentation > >> to integrate these two. > >> > >> Please Suggest > >> > >> Regards, > >> Stuti Awasthi > >> > >> ::DISCLAIMER:: > >> ----------------------------------------------------------------------------------------------------------------------- > >> > >> The contents of this e-mail and any attachment(s) are confidential and > >> intended for the named recipient(s) only. > >> It shall not attach any liability on the originator or HCL or its > >> affiliates. Any views or opinions presented in > >> this email are solely those of the author and may not necessarily reflect > >> the opinions of HCL or its affiliates. > >> Any form of reproduction, dissemination, copying, disclosure, > >> modification, distribution and / or publication of > >> this message without the prior written consent of the author of this > >> e-mail is strictly prohibited. If you have > >> received this email in error please delete it and notify the sender > >> immediately. Before opening any mail and > >> attachments please check them for viruses and defect. > >> > >> ----------------------------------------------------------------------------------------------------------------------- > >
