Hi,  What is the best way to ingest large amounts of csv data coming in at 
regular intervals (about every 15min for a total of about 500G/daily or 1.5B 
records/daily) that requires a few transformations before being inserted?

By transformation I mean the following:
1) 1 field is converted to a timestamp
2) 1 field is parsed to create a new field
3) several fields are combined into 1
4) a couple columns need to be reordered

Is there anyway to make these transformations through the bulk load tool or is 
MR the best route?
If I use MR should I go purely through JDBC? Write directly to hbase?  Doing 
something similar to the csv bulk load tool (Perhaps even just customizing the 
CsvBulkLoadTool?) or something altogether different?

Thanks!
Ralph

__________________________________________________
Ralph Perko
Pacific Northwest National Laboratory

Reply via email to