investigating replacing RDBMS with HBase based solution - spliting daily data inflow?

Igor Lautar Tue, 14 Feb 2012 04:49:20 -0800

Hi All,

I'm doing an investigation in performance and scalability improvements for
one of solutions. I'm currently in a phase where I try to understand if
HBase (+MapReduce) could provide the scalability needed.


This is the current situation:
 - assume daily inflow of 10 GB of data (20+ milion rows)
 - daily job running on top of daily data
 - monthly job running on top of monthly data
 - random access to small amount of data going back in time for longer
periods (assume a year)

Now the HBase questions:
1) how would one approach splitting the data on nodes?
Considering the daily MapReduce job it would have to run, it would be best
to do separate data on daily basis?
Is this possible with single table or would it make sense to have 1 table
per day (or similar)?
I did some investigation on this and it seems one could implement custom
getSplits() to map only part in table containing daily data?

Monthly job then just reuses the same data as daily, but it has to go
through all days in month.

2) random access case
Is this feasible with HBase at all? There could be something like
few million random read requests going back a year in time. Note that
certain amount of latency is not of a big issue as reads are done for
independent operations.

There are plans to support larger amounts of data. My thinking is that
first 3 points could scale very good horizontally, what about random reads?

Regards,
Igor

investigating replacing RDBMS with HBase based solution - spliting daily data inflow?

Reply via email to