I am planning out a central database for contact information, invoices and a bunch of other domain specific information that well be coming from hundreds geographically disparate locations. With the requirements of having every change ever made kept forever I wanted this in Hadoop/HBASE but am not sure the best architecture for the problem.
The same person will exist in maybe a dozen or more of these locations, so both the view of that person from all locations must be visible and a unified view will also be needed. Additionally every change (incluse of updates/deletes/inserts) is to be recorded permanently. This would be needed for every records for random read use, you could pull up an address history for the person and you could also see the history of that person for every location we have data from for him. Additionally I would want to be able to take all the changes from the beginning of records to some arbitrary year and 'playback' the updates, or something to this effect, to have a database that was exactly what we had at that time. Usual usage would be the primary contact information and facts about him (e.g. - he attended this event, got this kind of training etc.). Next to that you would have invoices, notes, his data as it is from a specific location and historical records both for the unified view and based the based on the local data. The major things I thought would include a Flume style storage of all updates into files in HDFS for the playback style and cooking it into HBase at the same time. How it's stored in HBase I get a little murky, I had been planning on a column family per major dataset - e.g. invoices, notes, local data sets etc. I know the docs it suggest against more than 2 families so I don't know this one. Does anyone have any tips or places that do a similar thing? Best, James
