Re: data partitioning and data model

Michael Segel Mon, 23 Feb 2015 16:10:44 -0800

Yes and no. 

Its a bit more complicated and it is also data dependent and how you’re using 
the data.


I wouldn’t go too thin and I wouldn’t go to fat. 

> On Feb 20, 2015, at 2:19 PM, Alok Singh <[email protected]> wrote:
> 
> You don't want a lot of columns in a write heavy table. HBase stores
> the "row key" along with each cell/column (Though old, I find this
> still useful: 
> http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
> Having a lot of columns will amplify the amount of data being stored.
> 
> That said, if there are only going to be a handful of alert_ids for a
> given "user_id+timestamp" row key, then you should be ok.
> 
> The query "Select * from table where user_id = X and timestamp > T and
> (alert_id = id1 or alert_id = id2)" can be accomplished with either
> design. See QualifierFilter and FuzzyRowFilter docs to get some ideas.
> 
> Alok
> 
> On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON)
> <[email protected]> wrote:
>> Hi Alok,
>> 
>> Thanks for the answer. Yes, I have read this section, but it was a little 
>> too abstract for me, I think I was needing to check my understanding. Your 
>> answer helped me to confirm I am on the right path, thanks for that.
>> 
>> One question: if instead of using user_id + timestamp + alert_id  I use 
>> user_id + timestamp as row key, I would still be able to store alert_id + 
>> alert_data in columns, right?
>> 
>> I took the idea from the last section of this link: 
>> http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/
>> 
>> But I wonder which option would be better for my case. It seems column scans 
>> are not so fast as row scans, but what would be the advantages of one design 
>> over the other?
>> 
>> If I use something like:
>> Row key: user_id + timestamp
>> Column prefix: alert_id
>> Column value: json with alert data
>> 
>> Would I be able to do a query like the one bellow?
>> Select * from table where user_id = X and timestamp > T and (alert_id = id1 
>> or alert_id = id2)
>> 
>> Would I be able to do the same query using user_id + timestamp + alert_id as 
>> row key?
>> 
>> Also, I know Cassandra supports up to 2 billion columns per row (2 billion 
>> rows per partition in CQL), do you know what's the limit for HBase?
>> 
>> Best regards,
>> Marcelo Valle.
>> 
>> From: [email protected]
>> Subject: Re: data partitioning and data model
>> 
>> You can use a key like (user_id + timestamp + alert_id) to get
>> clustering of rows related to a user. To get better write throughput
>> and distribution over the cluster, you could pre-split the table and
>> use a consistent hash of the user_id as a row key prefix.
>> 
>> Have you looked at the rowkey design section in the hbase book :
>> http://hbase.apache.org/book.html#rowkey.design
>> 
>> Alok
>> 
>> On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
>> <[email protected]> wrote:
>>> Hello,
>>> 
>>> This is my first message in this mailing list, I just subscribed.
>>> 
>>> I have been using Cassandra for the last few years and now I am trying to 
>>> create a POC using HBase. Therefore, I am reading the HBase docs but it's 
>>> been really hard to find how HBase behaves in some situations, when 
>>> compared to Cassandra. I thought maybe it was a good idea to ask here, as 
>>> people in this list might know the differences better than anyone else.
>>> 
>>> What I want to do is creating a simple application optimized for writes 
>>> (not interested in HBase / Cassandra product comparisions here, I am 
>>> assuming I will use HBase and that's it, just wanna understand the best way 
>>> of doing it in HBase world). I want to be able to write alerts to the 
>>> cluster, where each alert would have columns like:
>>> - alert id
>>> - user id
>>> - date/time
>>> - alert data
>>> 
>>> Later, I want to search for alerts per user, so my main query could be 
>>> considered to be something like:
>>> Select * from alerts where user_id = $id and date/time > 10 days ago.
>>> 
>>> I want to decide the data model for my application.
>>> 
>>> Here are my questions:
>>> 
>>> - In Cassandra, I would partition by user + day, as some users can have 
>>> many alerts and some just 1 or a few. In hbase, assuming all alerts for a 
>>> user would always fit in a single partition / region, can I just use 
>>> user_id as my row key and assume data will be distributed along the cluster?
>>> 
>>> - Suppose I want to write 100 000 rows from a client machine and these are 
>>> from 30 000 users. What's the best manner to write these if I want to 
>>> optimize for writes? Should I batch all 100 k requests in one to a single 
>>> server? As I am trying to optimize for writes, I would like to split these 
>>> requests across several nodes instead of sending them all to one. I found 
>>> this article: 
>>> http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But 
>>> not sure if it's what I need
>>> 
>>> Thanks in advance!
>>> 
>>> Best regards,
>>> Marcelo.
>> 
>> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: data partitioning and data model

Reply via email to