So, the data is in HDFS, from where I need to transform it into a structure
that is appropriate for visualization, which I do with a set of views.

Every day I want to pick up the newest files and pack them into Parquet
files, so that our BI tool runs a bit snappier, because running everything
off JSON files that are accessed through views is too slow. While new data
is inserted, a process over which I have no control, I UNION ALL the
previous days' data with the current day's data.

I would normally write an INSERT INTO to run every night that only takes
the recent data. Since this is not supported I'm curious to see how else I
can solve this. As I said, a CTAS on all the existing data plus the latest
additions is not viable performance-wise, apart from the fact that it's an
ugly solution. Similarly, doing a CTAS to a dummy table followed by a copy
to the right directory is a hack that I don't consider acceptable for
production purposes.

Any thoughts are greatly appreciated!

Reply via email to