Here's the simple thing to consider... If you are running M/R jobs against the data... HBase hands down is the winner.
If you are looking at a stand alone cluster ... Cassandra wins. HBase is still a fickle beast. Of course I just bottom lined it. :-) On Nov 29, 2012, at 10:51 PM, Lance Norskog <[email protected]> wrote: > Please! There are lots of blogs etc. about the two, but very few head-to-head > for a real use case. > > From: "anil gupta" <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Wednesday, November 28, 2012 11:01:55 AM > Subject: Re: Best practice for storage of data that changes > > Hi Jeff, > > At my workplace "Intuit", we did some detailed study to evaluate HBase and > Cassandra for our use case. I will see if i can post the comparative study on > my public blog or on this mailing list. > > BTW, What is your use case? What bottleneck are you hitting at current > solutions? If you can share some details then HBase community will try to > help you out. > > Thanks, > Anil Gupta > > > On Wed, Nov 28, 2012 at 9:55 AM, jeff l <[email protected]> wrote: > Hi, > > I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql ) and > MongoDB but don't feel any are quite right for this problem. The amount of > data being stored and access requirements just don't match up well. > > I was hoping to keep the stack as simple as possible and just use hdfs but > everything I was seeing kept pointing to the need for some other datastore. > I'll check out both HBase and Cassandra. > > Thanks for the feedback. > > > On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <[email protected]> wrote: > Hi Jeff, > > My two cents below: > > 1st use case: Append-only data - e.g. weblogs or user logins > As others have already mentioned that Hadoop is suitable enough to store > append only data. If you want to do analysis of weblogs or user logins then > Hadoop is a suitable solution for it. > > > 2nd use case: Account/User data > First, of all i would suggest you to have a look at your use case then > analyze whether it really needs a NoSql solution or not. > As you were talking about maintaining User Data in NoSql. Why NoSql instead > of RDBMS? What is the size of data? Which NoSql features are the selling > points for you? > > For real time read writes you can have a look at Cassandra or HBase. But, i > would suggest you to have a very close look at both of them because both of > them have their own advantages. So, the choice will be dependent on your use > case. > > One added advantage with HBase is that it has a deeper integration with > Hadoop ecosystem so you can do a lot of stuff on HBase data using Hadoop > Tools. HBase has integration with Hive querying but AFAIK it has some > limitations. > > HTH, > Anil Gupta > > > On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija<[email protected]> > wrote: > Hi Jeff, > > As HDFS paradigm is "Write once and read many" you cannot be able to > update the files on HDFS. > But for your problem what you can do is you keep the logs/userdata in > hdfs with different timestamps. > Run some mapreduce jobs at certain intervals to extract required data > from those logs and put it to Hbase/Cassandra/Mongodb. > > Mongodb read performance is quite faster also it supports ad-hoc > querying. Also you can use Hadoop-MongoDB connector to read/write the data to > Mongodb thru Hadoop-Mapreduce. > > If you are very specific about updating the hdfs files directly then > you have to use any commercial Hadoop packages like MapR which supports > updating the HDFS files. > > Best, > Mahesh Balija, > Calsoft Labs. > > > > On Sun, Nov 25, 2012 at 9:40 AM, bharath > vissapragada<[email protected]> wrote: > Hi Jeff, > > Please look at [1] . You can store your data in HBase tables and query them > normally just by mapping them to Hive tables. Regarding Cassandra support, > please follow JIRA [2], its not yet in the trunk I suppose! > > [1] https://cwiki.apache.org/Hive/hbaseintegration.html > [2] https://issues.apache.org/jira/browse/HIVE-1434 > > Thanks, > > > On Sun, Nov 25, 2012 at 2:26 AM, jeff l <[email protected]> wrote: > Hi All, > > I'm coming from the RDBMS world and am looking at hdfs for long term data > storage and analysis. > > I've done some research and set up some smallish hdfs clusters with hive for > testing but I'm having a little trouble understanding how everything fits > together and was hoping someone could point me in the right direction. > > I'm looking at storing two types of data: > > 1. Append-only data - e.g. weblogs or user logins > 2. Account/User data > > HDFS seems to be perfect for append-only data like #1, but I'm having trouble > figuring out what to do with data that may change frequently. > > A simple example would be user data where various bits of information: email, > etc may change from day to day. Would hbase or cassandra be the better way > to go for this type of data, and can I overlay hive over all ( hdfs, hbase, > cassandra ) so that I can query the data through a single interface? > > Thanks in advance for any help. > > > > -- > Regards, > Bharath .V > w:http://researchweb.iiit.ac.in/~bharath.v > > > > > -- > Thanks & Regards, > Anil Gupta > > > > > -- > Thanks & Regards, > Anil Gupta
