Hi All, In my company currently we are working hard on deployment our cluster with HBase.
We talking of ~20 nodes to hold pretty big data (~1TB per day). As there is a lot of data, we need a retention method, i.e. a way to remove old data. The problem is that I can't/want to do it using TTL cause two reasons: 1. Different retention policy for different customers. 2. Policy might be changed. Of course, I can do it using nightly (weekly?) MR job that runs on all data and remove the old data. There is few problems: 1. Running on huge amount of data only to remove small portion of it. 2. It'll be a heavily MR job. 3. Need to perform main compaction afterwards - that will affect performance or even stop service (is that right???). I might use BulkFileOutputFormat for that job - but still have those problems. As my data sorted by the retention policies (customers and time) I thought of this option: 1. Split regions and create region with 'candidates to removed'. 2. Drop this region. - Is it possible to drop region? - Do you think it a good idea? - Any other ideas? Thanks, Ophir Cohen LivePerson
