Re: "Transactional" conversion of CSV to Parquet?

MattK Mon, 24 Oct 2016 14:45:30 -0700

All of this is trivial on a conventional file system or on MapR. Don't
think it works out of the box on HDFS (but am willing to becorrected).


I did not mention that I am using MapR-FS, so links are options.

On 24 Oct 2016, at 17:34, Ted Dunning wrote:

Yeah... it is quite doable. It helps a bit to have hard links.
The basic idea is to have one symbolic link that points to either oftwoping-pong staging directories. Whichever staging directory thesymbolic
points to is called the active staging directory, the other is called
inactive.
To insert new CSV data, move data into the inactive staging directory.Thencreate a hard link to the same file in the active staging directory.The
new data will appear atomically.

To consolidate old data to parquet, do the conversion in the inactive
staging directory. After the conversion succeeds, delete the csv filefromthe inactive directory. Because the active directory has a hard linkto thecsv file, it won't vanish from there. Then flip the symbolic link topointto the (old) inactive directory. This makes the two operations of thecsv
disappearing and the corresponding parquet file appearing happen in an
atomic moment.

The keys here are the hard links and the ping-ponging of staging
directories. If you just have a staging directory, then you won't have
atomic deletion of the csv file and creation of the parquet file.
Another subtlety here is the use of a symbolic link to point to theactive
directory. This means that whenever you read a directory contents, you
should get one staging directory or the other and thus a scan willgive you
*either* csv or parquet but *not* both.

All of this is trivial on a conventional file system or on MapR. Don't
think it works out of the box on HDFS (but am willing to becorrected).
On Mon, Oct 24, 2016 at 1:49 PM, MattK <[email protected]> wrote:
I have a cluster that receives log files in a csv format on aper-minute
basis, and those files are immediately available to Drill users. For
performance I create Parquet files from them in batch using CTAScommands.
I would like to script a process that makes the Parquet filesavailable oncreation, perhaps through a UNION view, but that does not serveduplicatedata through both an original csv and converted Parquet file at thesame
time.
Is there a common practice to making data available once converted,insomething similar to a transactional batch of "convert then (re)movesource
csv files" ?

Re: "Transactional" conversion of CSV to Parquet?

Reply via email to