So, the data is in HDFS, from where I need to transform it into a structure that is appropriate for visualization, which I do with a set of views.
Every day I want to pick up the newest files and pack them into Parquet files, so that our BI tool runs a bit snappier, because running everything off JSON files that are accessed through views is too slow. While new data is inserted, a process over which I have no control, I UNION ALL the previous days' data with the current day's data. I would normally write an INSERT INTO to run every night that only takes the recent data. Since this is not supported I'm curious to see how else I can solve this. As I said, a CTAS on all the existing data plus the latest additions is not viable performance-wise, apart from the fact that it's an ugly solution. Similarly, doing a CTAS to a dummy table followed by a copy to the right directory is a hack that I don't consider acceptable for production purposes. Any thoughts are greatly appreciated!
