Yeah... it is quite doable. It helps a bit to have hard links. The basic idea is to have one symbolic link that points to either of two ping-pong staging directories. Whichever staging directory the symbolic points to is called the active staging directory, the other is called inactive.
To insert new CSV data, move data into the inactive staging directory. Then create a hard link to the same file in the active staging directory. The new data will appear atomically. To consolidate old data to parquet, do the conversion in the inactive staging directory. After the conversion succeeds, delete the csv file from the inactive directory. Because the active directory has a hard link to the csv file, it won't vanish from there. Then flip the symbolic link to point to the (old) inactive directory. This makes the two operations of the csv disappearing and the corresponding parquet file appearing happen in an atomic moment. The keys here are the hard links and the ping-ponging of staging directories. If you just have a staging directory, then you won't have atomic deletion of the csv file and creation of the parquet file. Another subtlety here is the use of a symbolic link to point to the active directory. This means that whenever you read a directory contents, you should get one staging directory or the other and thus a scan will give you *either* csv or parquet but *not* both. All of this is trivial on a conventional file system or on MapR. Don't think it works out of the box on HDFS (but am willing to be corrected). On Mon, Oct 24, 2016 at 1:49 PM, MattK <[email protected]> wrote: > I have a cluster that receives log files in a csv format on a per-minute > basis, and those files are immediately available to Drill users. For > performance I create Parquet files from them in batch using CTAS commands. > > I would like to script a process that makes the Parquet files available on > creation, perhaps through a UNION view, but that does not serve duplicate > data through both an original csv and converted Parquet file at the same > time. > > Is there a common practice to making data available once converted, in > something similar to a transactional batch of "convert then (re)move source > csv files" ? >
