Hi All, I'm doing an investigation in performance and scalability improvements for one of solutions. I'm currently in a phase where I try to understand if HBase (+MapReduce) could provide the scalability needed.
This is the current situation: - assume daily inflow of 10 GB of data (20+ milion rows) - daily job running on top of daily data - monthly job running on top of monthly data - random access to small amount of data going back in time for longer periods (assume a year) Now the HBase questions: 1) how would one approach splitting the data on nodes? Considering the daily MapReduce job it would have to run, it would be best to do separate data on daily basis? Is this possible with single table or would it make sense to have 1 table per day (or similar)? I did some investigation on this and it seems one could implement custom getSplits() to map only part in table containing daily data? Monthly job then just reuses the same data as daily, but it has to go through all days in month. 2) random access case Is this feasible with HBase at all? There could be something like few million random read requests going back a year in time. Note that certain amount of latency is not of a big issue as reads are done for independent operations. There are plans to support larger amounts of data. My thinking is that first 3 points could scale very good horizontally, what about random reads? Regards, Igor
