Re: data partitioning and data model

2015-02-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
. Does this happen also when using pre-loading? In the case of a rebalance, if I try to WRITE data to a record being rebalanced, would the write performance be affected? Best regards, Marcelo Valle. From: user@hbase.apache.org Subject: Re: data partitioning and data model You don't want a lot

Re: data partitioning and data model

2015-02-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I am sorry, consider I am using auto pre-splitting for question bellow. From: user@hbase.apache.org Subject: Re: data partitioning and data model Thanks Alok, I will take a good look at the link for sure. Just an additional question, I saw, reading this: http://stackoverflow.com/questions

Re: data partitioning and data model

2015-02-23 Thread Alok Singh
partitioning and data model You don't want a lot of columns in a write heavy table. HBase stores the row key along with each cell/column (Though old, I find this still useful: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html) Having a lot of columns will amplify the amount of data

Re: data partitioning and data model

2015-02-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Thanks a lot! From: aloksi...@gmail.com Subject: Re: data partitioning and data model I meant, in the normal course of operation, rebalancing will not affect writes in flight. This is never an issue when pre splitting because, by definition, splits occurred before data was written

Re: data partitioning and data model

2015-02-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
distributed on every partition, I might end up having the increase in read/write latency when data is moving from a region to the other, although this could be rare, is this right? From: user@hbase.apache.org Subject: Re: data partitioning and data model Assuming the cluster is not manually

Re: data partitioning and data model

2015-02-23 Thread Alok Singh
in this thread to keep data almost evenly distributed on every partition, I might end up having the increase in read/write latency when data is moving from a region to the other, although this could be rare, is this right? From: user@hbase.apache.org Subject: Re: data partitioning and data model

Re: data partitioning and data model

2015-02-23 Thread Michael Segel
Hi, Yes you would want to start your key by user_id. But you don’t need the timestamp. The user_id + alert_id should be enough on the key. If you want to get fancy… If your alert_id is not a number, you could use the EPOCH - Timestamp as a way to invert the order of the alerts so that the

Re: data partitioning and data model

2015-02-23 Thread Michael Segel
Cassandra supports up to 2 billion columns per row (2 billion rows per partition in CQL), do you know what's the limit for HBase? Best regards, Marcelo Valle. From: aloksi...@gmail.com Subject: Re: data partitioning and data model You can use a key like (user_id + timestamp + alert_id

data partitioning and data model

2015-02-20 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Hello, This is my first message in this mailing list, I just subscribed. I have been using Cassandra for the last few years and now I am trying to create a POC using HBase. Therefore, I am reading the HBase docs but it's been really hard to find how HBase behaves in some situations, when

Re: data partitioning and data model

2015-02-20 Thread Alok Singh
You can use a key like (user_id + timestamp + alert_id) to get clustering of rows related to a user. To get better write throughput and distribution over the cluster, you could pre-split the table and use a consistent hash of the user_id as a row key prefix. Have you looked at the rowkey design

Re: data partitioning and data model

2015-02-20 Thread Marcelo Valle (BLOOMBERG/ LONDON)
what's the limit for HBase? Best regards, Marcelo Valle. From: aloksi...@gmail.com Subject: Re: data partitioning and data model You can use a key like (user_id + timestamp + alert_id) to get clustering of rows related to a user. To get better write throughput and distribution over the cluster

Re: data partitioning and data model

2015-02-20 Thread Alok Singh
+ timestamp + alert_id as row key? Also, I know Cassandra supports up to 2 billion columns per row (2 billion rows per partition in CQL), do you know what's the limit for HBase? Best regards, Marcelo Valle. From: aloksi...@gmail.com Subject: Re: data partitioning and data model You can