Hi, I did look more into this and have a better idea how it could be implemented.
As values are looked-up by dates (and sometimes additionally by source ID), it would make sense to store each value in separate row. rowkey would be some kind of timeseries, like: timestamp_sourceID However, docs suggest this is a bad idea as all inserts go to only one region at the time (as rowkeys have same/increasing begging). I have taken a look @ OpenTSDB schema where metric ID (or source ID in this case) is stored first, followed by timestamp (albeit 10m granularity, they store exact time details in columns). However, their scans know metric ID (at least this is what I saw by a quick look @ the code - please correct me if I'm wrong) for which scan is done, which we do not. In our case, we want to utilize hbase ability to do scans on partial keys to get all rows for specific day (or year/month). Assuming timestamp format is YYYY-MM-DDTHH:MM:SS (ignore the length of rowkey for purpose of discussion), we could scan for YYYY YYYY-MM YYYY-MM-DD etc. How can the same scan effeteness be achieved (i.e., not scanning the whole table and ignoring older/newer timestamps) if timestamp is not @ begging of rowkey? Regards, Igor On Tue, Feb 14, 2012 at 1:48 PM, Igor Lautar <[email protected]> wrote: > Hi All, > > I'm doing an investigation in performance and scalability improvements for > one of solutions. I'm currently in a phase where I try to understand if > HBase (+MapReduce) could provide the scalability needed. > > This is the current situation: > - assume daily inflow of 10 GB of data (20+ milion rows) > - daily job running on top of daily data > - monthly job running on top of monthly data > - random access to small amount of data going back in time for longer > periods (assume a year) > > Now the HBase questions: > 1) how would one approach splitting the data on nodes? > Considering the daily MapReduce job it would have to run, it would be best > to do separate data on daily basis? > Is this possible with single table or would it make sense to have 1 table > per day (or similar)? > I did some investigation on this and it seems one could implement custom > getSplits() to map only part in table containing daily data? > > Monthly job then just reuses the same data as daily, but it has to go > through all days in month. > > 2) random access case > Is this feasible with HBase at all? There could be something like > few million random read requests going back a year in time. Note that > certain amount of latency is not of a big issue as reads are done for > independent operations. > > There are plans to support larger amounts of data. My thinking is that > first 3 points could scale very good horizontally, what about random reads? > > Regards, > Igor >
