Yes and no. Its a bit more complicated and it is also data dependent and how you’re using the data.
I wouldn’t go too thin and I wouldn’t go to fat. > On Feb 20, 2015, at 2:19 PM, Alok Singh <[email protected]> wrote: > > You don't want a lot of columns in a write heavy table. HBase stores > the "row key" along with each cell/column (Though old, I find this > still useful: > http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html) > Having a lot of columns will amplify the amount of data being stored. > > That said, if there are only going to be a handful of alert_ids for a > given "user_id+timestamp" row key, then you should be ok. > > The query "Select * from table where user_id = X and timestamp > T and > (alert_id = id1 or alert_id = id2)" can be accomplished with either > design. See QualifierFilter and FuzzyRowFilter docs to get some ideas. > > Alok > > On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON) > <[email protected]> wrote: >> Hi Alok, >> >> Thanks for the answer. Yes, I have read this section, but it was a little >> too abstract for me, I think I was needing to check my understanding. Your >> answer helped me to confirm I am on the right path, thanks for that. >> >> One question: if instead of using user_id + timestamp + alert_id I use >> user_id + timestamp as row key, I would still be able to store alert_id + >> alert_data in columns, right? >> >> I took the idea from the last section of this link: >> http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/ >> >> But I wonder which option would be better for my case. It seems column scans >> are not so fast as row scans, but what would be the advantages of one design >> over the other? >> >> If I use something like: >> Row key: user_id + timestamp >> Column prefix: alert_id >> Column value: json with alert data >> >> Would I be able to do a query like the one bellow? >> Select * from table where user_id = X and timestamp > T and (alert_id = id1 >> or alert_id = id2) >> >> Would I be able to do the same query using user_id + timestamp + alert_id as >> row key? >> >> Also, I know Cassandra supports up to 2 billion columns per row (2 billion >> rows per partition in CQL), do you know what's the limit for HBase? >> >> Best regards, >> Marcelo Valle. >> >> From: [email protected] >> Subject: Re: data partitioning and data model >> >> You can use a key like (user_id + timestamp + alert_id) to get >> clustering of rows related to a user. To get better write throughput >> and distribution over the cluster, you could pre-split the table and >> use a consistent hash of the user_id as a row key prefix. >> >> Have you looked at the rowkey design section in the hbase book : >> http://hbase.apache.org/book.html#rowkey.design >> >> Alok >> >> On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON) >> <[email protected]> wrote: >>> Hello, >>> >>> This is my first message in this mailing list, I just subscribed. >>> >>> I have been using Cassandra for the last few years and now I am trying to >>> create a POC using HBase. Therefore, I am reading the HBase docs but it's >>> been really hard to find how HBase behaves in some situations, when >>> compared to Cassandra. I thought maybe it was a good idea to ask here, as >>> people in this list might know the differences better than anyone else. >>> >>> What I want to do is creating a simple application optimized for writes >>> (not interested in HBase / Cassandra product comparisions here, I am >>> assuming I will use HBase and that's it, just wanna understand the best way >>> of doing it in HBase world). I want to be able to write alerts to the >>> cluster, where each alert would have columns like: >>> - alert id >>> - user id >>> - date/time >>> - alert data >>> >>> Later, I want to search for alerts per user, so my main query could be >>> considered to be something like: >>> Select * from alerts where user_id = $id and date/time > 10 days ago. >>> >>> I want to decide the data model for my application. >>> >>> Here are my questions: >>> >>> - In Cassandra, I would partition by user + day, as some users can have >>> many alerts and some just 1 or a few. In hbase, assuming all alerts for a >>> user would always fit in a single partition / region, can I just use >>> user_id as my row key and assume data will be distributed along the cluster? >>> >>> - Suppose I want to write 100 000 rows from a client machine and these are >>> from 30 000 users. What's the best manner to write these if I want to >>> optimize for writes? Should I batch all 100 k requests in one to a single >>> server? As I am trying to optimize for writes, I would like to split these >>> requests across several nodes instead of sending them all to one. I found >>> this article: >>> http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But >>> not sure if it's what I need >>> >>> Thanks in advance! >>> >>> Best regards, >>> Marcelo. >> >> >
smime.p7s
Description: S/MIME cryptographic signature
