Whether you keep 24 column .Hbase will still store it as 24 records only. There are benefits on use cases. Like breaking into column families , how updates behave for a particular rowkey if you go with 24 columns.
From: Pariksheet Barapatre [mailto:[email protected]] Sent: Tuesday, June 23, 2015 7:07 PM To: [email protected] Subject: Re: Data Model Suggestion Thanks Skanda for your suggestion. If we go with 24 columns , won't it reduce number of rows over large data set and will improve scan performance. If I want to get daily report, I just have to scan only one row. Your thoughts? Thanks Pari On 23 June 2015 at 18:57, Skanda <[email protected]<mailto:[email protected]>> wrote: Hi Pari For your use-case, having it as part of the rowkey should be a better design than creating so many columns. Regards Skanda On Tue, Jun 23, 2015 at 6:49 PM, Pariksheet Barapatre <[email protected]<mailto:[email protected]>> wrote: Hello All, This is more like a HBase question but as I am planning to use Phoenix as a access layer, I hope phoenix user will help me. I would like to create time series data to get on-the-fly analytics. This use case is for adTech. Report - what is houly,daily,weekly impression counts at country level for a given advertisement ID (ADID). I am doing hourly aggregation and loading into a Phoenix table. Primary Key - ADID | COUNTRY | HOUR ID --------------------------------------------------------------------------------- ADID | COUNTRY | HOUR ID | CF.IMP | --------------------------------------------------------------------------------- 1 | US | 2015062301<tel:2015062301> | 3000 | 1 | US | 2015062302<tel:2015062302> | 3421 | 1 | UK | 2015062302<tel:2015062302> | 1212 | --------------------------------------------------------------------------------- Is it a good schema design or shall I create alternate schema as below Primary Key - ADID | COUNTRY | DAY ID ---------------------------------------------------------------------------------------------------- ADID | COUNTRY | DAY ID | CF.IMP01 | CF.IMP02 | ---------------------------------------------------------------------------------------------------- 1 | US | 20150623 | 3000 | 3421 | 1 | UK | 20150623 | NULL | 1212 | ---------------------------------------------------------------------------------------------------- Here, I have taken hour part from hour ID and created 24 columns. I gone through many time-series NoSQL blog posts, most the author suggest to go with wider rows as above. This will reduce the scan, but I don't see much difference in both Data Models in term of latency for scanning. Can anybody please suggest good approach for my use case? Cheers, Pari -- Cheers, Pari
