Re: data partitioning and data model

Marcelo Valle (BLOOMBERG/ LONDON) Fri, 20 Feb 2015 11:24:28 -0800

Hi Alok, 

Thanks for the answer. Yes, I have read this section, but it was a little too 
abstract for me, I think I was needing to check my understanding. Your answer 
helped me to confirm I am on the right path, thanks for that.


One question: if instead of using user_id + timestamp + alert_id  I use user_id 
+ timestamp as row key, I would still be able to store alert_id + alert_data in 
columns, right?

I took the idea from the last section of this link: 
http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/

But I wonder which option would be better for my case. It seems column scans 
are not so fast as row scans, but what would be the advantages of one design 
over the other?

If I use something like:
Row key: user_id + timestamp
Column prefix: alert_id 
Column value: json with alert data

Would I be able to do a query like the one bellow?
Select * from table where user_id = X and timestamp > T and (alert_id = id1 or 
alert_id = id2)

Would I be able to do the same query using user_id + timestamp + alert_id as 
row key?

Also, I know Cassandra supports up to 2 billion columns per row (2 billion rows 
per partition in CQL), do you know what's the limit for HBase?

Best regards,
Marcelo Valle.

From: [email protected] 
Subject: Re: data partitioning and data model

You can use a key like (user_id + timestamp + alert_id) to get
clustering of rows related to a user. To get better write throughput
and distribution over the cluster, you could pre-split the table and
use a consistent hash of the user_id as a row key prefix.

Have you looked at the rowkey design section in the hbase book :
http://hbase.apache.org/book.html#rowkey.design

Alok

On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
<[email protected]> wrote:
> Hello,
>
> This is my first message in this mailing list, I just subscribed.
>
> I have been using Cassandra for the last few years and now I am trying to 
> create a POC using HBase. Therefore, I am reading the HBase docs but it's 
> been really hard to find how HBase behaves in some situations, when compared 
> to Cassandra. I thought maybe it was a good idea to ask here, as people in 
> this list might know the differences better than anyone else.
>
> What I want to do is creating a simple application optimized for writes (not 
> interested in HBase / Cassandra product comparisions here, I am assuming I 
> will use HBase and that's it, just wanna understand the best way of doing it 
> in HBase world). I want to be able to write alerts to the cluster, where each 
> alert would have columns like:
> - alert id
> - user id
> - date/time
> - alert data
>
> Later, I want to search for alerts per user, so my main query could be 
> considered to be something like:
> Select * from alerts where user_id = $id and date/time > 10 days ago.
>
> I want to decide the data model for my application.
>
> Here are my questions:
>
> - In Cassandra, I would partition by user + day, as some users can have many 
> alerts and some just 1 or a few. In hbase, assuming all alerts for a user 
> would always fit in a single partition / region, can I just use user_id as my 
> row key and assume data will be distributed along the cluster?
>
> - Suppose I want to write 100 000 rows from a client machine and these are 
> from 30 000 users. What's the best manner to write these if I want to 
> optimize for writes? Should I batch all 100 k requests in one to a single 
> server? As I am trying to optimize for writes, I would like to split these 
> requests across several nodes instead of sending them all to one. I found 
> this article: 
> http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But 
> not sure if it's what I need
>
> Thanks in advance!
>
> Best regards,
> Marcelo.

Re: data partitioning and data model

Reply via email to