Re: Best practice for storage of data that changes

Michael Segel Fri, 30 Nov 2012 12:45:10 -0800

Here's the simple thing to consider... 

If you are running M/R jobs against the data... HBase hands down is the winner.


If you are looking at a stand alone cluster ... Cassandra wins. HBase is still 
a fickle beast.

Of course I just bottom lined it.  :-) 


On Nov 29, 2012, at 10:51 PM, Lance Norskog <[email protected]> wrote:

> Please! There are lots of blogs etc. about the two, but very few head-to-head 
> for a real use case.
> 
> From: "anil gupta" <[email protected]>
> To: "[email protected]" <[email protected]>
> Sent: Wednesday, November 28, 2012 11:01:55 AM
> Subject: Re: Best practice for storage of data that changes
> 
> Hi Jeff,
> 
> At my workplace "Intuit", we did some detailed study to evaluate HBase and 
> Cassandra for our use case. I will see if i can post the comparative study on 
> my public blog or on this mailing list.
> 
> BTW, What is your use case? What bottleneck are you hitting at current 
> solutions? If you can share some details then HBase community will try to 
> help you out.
> 
> Thanks,
> Anil Gupta
> 
> 
> On Wed, Nov 28, 2012 at 9:55 AM, jeff l <[email protected]> wrote:
> Hi,
> 
> I have quite a bit of experience with RDBMSs ( Oracle, Postgres, Mysql ) and 
> MongoDB but don't feel any are quite right for this problem.  The amount of 
> data being stored and access requirements just don't match up well.
> 
> I was hoping to keep the stack as simple as possible and just use hdfs but 
> everything I was seeing kept pointing to the need for some other datastore.  
> I'll check out both HBase and Cassandra.
> 
> Thanks for the feedback.
> 
> 
> On Sun, Nov 25, 2012 at 1:11 PM, anil gupta <[email protected]> wrote:
> Hi Jeff,
> 
> My two cents below:
> 
> 1st use case: Append-only data - e.g. weblogs or user logins
> As others have already mentioned that Hadoop is suitable enough to store 
> append only data. If you want to do analysis of weblogs or user logins then 
> Hadoop is a suitable solution for it.
> 
> 
> 2nd use case: Account/User data
> First, of all i would suggest you to have a look at your use case then 
> analyze whether it really needs a NoSql solution or not. 
> As you were talking about maintaining User Data in NoSql. Why NoSql instead 
> of RDBMS? What is the size of data? Which NoSql features are the selling 
> points for you?
> 
> For real time read writes you can have a look at Cassandra or HBase. But, i 
> would suggest you to have a very close look at both of them because both of 
> them have their own advantages. So, the choice will be dependent on your use 
> case. 
> 
> One added advantage with HBase is that it has a deeper integration with 
> Hadoop ecosystem so you can do a lot of stuff on HBase data  using Hadoop 
> Tools. HBase has integration with Hive querying but AFAIK it has some 
> limitations.
> 
> HTH,
> Anil Gupta
> 
> 
> On Sun, Nov 25, 2012 at 4:52 AM, Mahesh Balija<[email protected]> 
> wrote:
> Hi Jeff,
> 
>         As HDFS paradigm is "Write once and read many" you cannot be able to 
> update the files on HDFS.
>         But for your problem what you can do is you keep the logs/userdata in 
> hdfs with different timestamps.
>         Run some mapreduce jobs at certain intervals to extract required data 
> from those logs and put it to Hbase/Cassandra/Mongodb.
> 
>         Mongodb read performance is quite faster also it supports ad-hoc 
> querying. Also you can use Hadoop-MongoDB connector to read/write the data to 
> Mongodb thru Hadoop-Mapreduce.
>      
>         If you are very specific about updating the hdfs files directly then 
> you have to use any commercial Hadoop packages like MapR which supports 
> updating the HDFS files.
> 
> Best,
> Mahesh Balija,
> Calsoft Labs.
> 
> 
> 
> On Sun, Nov 25, 2012 at 9:40 AM, bharath 
> vissapragada<[email protected]> wrote:
> Hi Jeff,
> 
> Please look at [1] . You can store your data in HBase tables and query them 
> normally just by mapping them to Hive tables. Regarding Cassandra support, 
> please follow JIRA [2], its not yet in the trunk I suppose!
> 
> [1] https://cwiki.apache.org/Hive/hbaseintegration.html
> [2] https://issues.apache.org/jira/browse/HIVE-1434
> 
> Thanks,
> 
> 
> On Sun, Nov 25, 2012 at 2:26 AM, jeff l <[email protected]> wrote:
> Hi All,
> 
> I'm coming from the RDBMS world and am looking at hdfs for long term data 
> storage and analysis.
> 
> I've done some research and set up some smallish hdfs clusters with hive for 
> testing but I'm having a little trouble understanding how everything fits 
> together and was hoping someone could point me in the right direction.
> 
> I'm looking at storing two types of data:
> 
> 1. Append-only data - e.g. weblogs or user logins
> 2. Account/User data
> 
> HDFS seems to be perfect for append-only data like #1, but I'm having trouble 
> figuring out what to do with data that may change frequently.
> 
> A simple example would be user data where various bits of information: email, 
> etc may change from day to day.  Would hbase or cassandra be the better way 
> to go for this type of data, and can I overlay hive over all ( hdfs, hbase, 
> cassandra ) so that I can query the data through a single interface?
> 
> Thanks in advance for any help.
> 
> 
> 
> -- 
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v
> 
> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta
> 
> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta

Re: Best practice for storage of data that changes

Reply via email to