I have a cluster that receives log files in a csv format on a per-minute
basis, and those files are immediately available to Drill users. For
performance I create Parquet files from them in batch using CTAS
commands.
I would like to script a process that makes the Parquet files available
on creation, perhaps through a UNION view, but that does not serve
duplicate data through both an original csv and converted Parquet file
at the same time.
Is there a common practice to making data available once converted, in
something similar to a transactional batch of "convert then (re)move
source csv files" ?
- "Transactional" conversion of CSV to Parquet? MattK
-