HBase Key Design : Doubt

Narayanan K Wed, 10 Oct 2012 12:13:57 -0700

Hi all,

I have a usecase wherein I need to find the unique of some things in HBase
across dates.


Say, on 1st Oct, A-B-C-D appeared, hence I insert a row with rowkey :
A-B-C-D.
On 2nd Oct, I get the same value A-B-C-D and I don't want to redundantly
store the row again with a new rowkey - A-B-C-D for 2nd Oct
i.e I will not want to have 20121001-A-B-C-D and 20121002-A-B-C-D as 2
rowkeys in the table.

Eg: If I have 1st Oct , 2nd Oct as 2 column families and if number of
versions are set to 1, only 1 row will be present in for both the dates
having rowkey A-B-C-D.
Hence if I need to find unique number of times A-B-C-D appeared during Oct
1 and Oct 2, I just need to take rowcount of the row A-B-C-D by filtering
over the 2 column families.
Similarly, if we have 10  date column families, and I need to scan only for
2 dates, then it scans only those store files having the specified column
families. This will make scanning faster.

But here the design problem is that I cant add more column families to the
table each day.

I would need to store data every day and I read that HBase doesnt work well
with more than 3 column families.

The other option is to have one single column family and store dates as
qualifiers : date:d1, date:d2.... But here if there are 30 date qualifiers
under date column family, to scan a single date qualifier or may be range
of 2-3 dates will have to scan through the entire data of all d1 to d30
qualifiers in the date column family which would be slower compared to
having separate column families for the each date..

Please share your thoughts on this. Also any alternate design suggestions
you might have.

Regards,
Narayanan

HBase Key Design : Doubt

Reply via email to