Very nice presentation. Awesome simulation tool! Couldn't help to leave a comment. Or two.
1. It is even possible to set qualifier name to empty byte[]. This might help to save you some extra byte(s) ;) 2. It looks like after several days you have in memstore a lot of data which is not frequently accessed. I.e. those memstores of the regions that holds several days+ old data. Would be great to use this valuable main memory for storing frequently accessed data. Quick thoughts: * perform manual flush of older regions' memstores periodically, this will free that memory and then use it: ** for bigger memstore (I believe that should esp. improve your timings for fetching data older than hour (there's kinda a spike on fetch time chart there)) ** for bigger block caches ** having more "hot" regions per RS Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr P.S. Any chance of converting the first video of simulation tool to gif or smth and allow using for teaching? ;) P.S.-2 Have you tried to connect in to the real cluster already? I know we are all busy, but still hopes are that you'll find the time. Btw, I believe it will be soon easier to integrate it as hbase metrics are getting a lot of attention. They should be much more usable soon. On Thu, Jul 26, 2012 at 1:06 PM, Cristofer Weber < [email protected]> wrote: > Hi there > > There are some really good ideas in this presentation from HBaseCon: > http://www.cloudera.com/resource/video-hbasecon-2012-real-performance-gains-with-real-time-data/ > > Regards, > Cristofer > > -----Mensagem original----- > De: Alex Baranau [mailto:[email protected]] > Enviada em: quinta-feira, 26 de julho de 2012 11:28 > Para: [email protected] > Assunto: Re: Hbase Data Model to purge old data. > > > reason for > > this is bulk delete of one days data within a big table is more > > expensive > than > > dropping a one day table > > Sorry for the obvious question, but have you tried using TTLs instead of > deleting rows explicitly? This should bring less load on the cluster, > though you'll still have to run major_compaction, which might be a resource > intensive process. > > > In this per-day-separate-table model, the load balancer will never get > triggered > > as the current days table is always in memory, and daughter regions > > will continuously get assigned to same region server. This leads to a > > region > server > > hotspots. > > Again, may be an obvious q: have you tried to (or is it possible in your > case to) pre-split table so that regions are distributed over the cluster > from the start? > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > On Thu, Jul 26, 2012 at 2:34 AM, Padmanaban <[email protected] > >wrote: > > > We have the following use case: > > > > Store telecom CDR data on a per subscriber basis data is time series > > based and every record is per-subscriber based comes in round the > > clock the expected volume of data would be around 300 million > > records/day. > > this data is to be queried 24/7 by an online system where the filters > > are subscriber id and date range > > > > Since the volume of data is huge, we have data retention policies to > > archive old data on a daily basis. > > For example, if retention is set to 90 days, every day a offline > > process would delete data from Hbase which is older than 90 days and > > archive it on tape. > > > > The current HBase data model design is as follows: > > Separate table for every day's data with row key as subscriber id: > > reason for this is bulk delete of one days data within a big table is > > more expensive than dropping a one day table In this > > per-day-separate-table model, the load balancer will never get > > triggered as the current days table is always in memory, and daughter > > regions will continuously get assigned to same region server. This > > leads to a region server hotspots. > > > > Please feedback on whether the per-day-separate-table model is the > > best-practice for this use case considering the data life cycle > > management requirement. > > If > > yes, how do we solve the side effect of region server hotspot? If no, > > please advice alternate model > > > > Thanks in advance, > > Padmanaban M > > > > > > > > > -- > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > -- Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
