Hi, What is the best way to ingest large amounts of csv data coming in at regular intervals (about every 15min for a total of about 500G/daily or 1.5B records/daily) that requires a few transformations before being inserted?
By transformation I mean the following: 1) 1 field is converted to a timestamp 2) 1 field is parsed to create a new field 3) several fields are combined into 1 4) a couple columns need to be reordered Is there anyway to make these transformations through the bulk load tool or is MR the best route? If I use MR should I go purely through JDBC? Write directly to hbase? Doing something similar to the csv bulk load tool (Perhaps even just customizing the CsvBulkLoadTool?) or something altogether different? Thanks! Ralph __________________________________________________ Ralph Perko Pacific Northwest National Laboratory
