Schema Design Question

Cameron Gandevia Fri, 26 Apr 2013 12:50:09 -0700

Hi

I am new to HBase, I have been trying to POC an application and have a
design questions.


Currently we have a single table with the following key design

jobId_batchId_bundleId_uniquefileId

This is an offline processing system so data would be bulk loaded into
HBase via map/reduce jobs. We only need to support report generation
queries using map/reduce over a batch (And possibly a single column filter)
with the batchId as the start/end scan key. Once we have finished
processing a job we are free to remove the data from HBase.

We have varied workloads so a job could be made up of 10 rows, 100,000 rows
or 1 billion rows with the average falling somewhere around 10 million rows.

My question is related to pre-splitting. If we have a billion rows all with
the same batchId (Our map/reduce scan key) my understanding is we should
perform pre-splitting to create buckets hosted by different regions. If a
jobs workload can be so varied would it make sense to have a single table
containing all jobs? Or should we create 1 table per job and pre-split the
table for the given workload? If we had separate table we could drop them
when no longer needed.

If we didn't have a separate table per job how should we perform splitting?
Should we choose our largest possible workload and split for that? even
though 90% of our jobs would fall in the lower bound in terms of row count.
Would we experience any issue purging jobs of varying sizes if everything
was in a single table?

any advice would be greatly appreciated.

Thanks

Schema Design Question

Reply via email to