I have a cluster that receives log files in a csv format on a per-minute basis, and those files are immediately available to Drill users. For performance I create Parquet files from them in batch using CTAS commands.

I would like to script a process that makes the Parquet files available on creation, perhaps through a UNION view, but that does not serve duplicate data through both an original csv and converted Parquet file at the same time.

Is there a common practice to making data available once converted, in something similar to a transactional batch of "convert then (re)move source csv files" ?

Reply via email to