Re: "Transactional" conversion of CSV to Parquet?

2016-10-24 Thread MattK

All of this is trivial on a conventional file system or on MapR. Don't
think it works out of the box on HDFS (but am willing to be 
corrected).


I did not mention that I am using MapR-FS, so links are options.

On 24 Oct 2016, at 17:34, Ted Dunning wrote:


Yeah... it is quite doable. It helps a bit to have hard links.

The basic idea is to have one symbolic link that points to either of 
two
ping-pong staging directories. Whichever staging directory the 
symbolic

points to is called the active staging directory, the other is called
inactive.

To insert new CSV data, move data into the inactive staging directory. 
Then
create a hard link to the same file in the active staging directory. 
The

new data will appear atomically.

To consolidate old data to parquet, do the conversion in the inactive
staging directory. After the conversion succeeds, delete the csv file 
from
the inactive directory. Because the active directory has a hard link 
to the
csv file, it won't vanish from there. Then flip the symbolic link to 
point
to the (old) inactive directory. This makes the two operations of the 
csv

disappearing and the corresponding parquet file appearing happen in an
atomic moment.

The keys here are the hard links and the ping-ponging of staging
directories. If you just have a staging directory, then you won't have
atomic deletion of the csv file and creation of the parquet file.

Another subtlety here is the use of a symbolic link to point to the 
active

directory. This means that whenever you read a directory contents, you
should get one staging directory or the other and thus a scan will 
give you

*either* csv or parquet but *not* both.

All of this is trivial on a conventional file system or on MapR. Don't
think it works out of the box on HDFS (but am willing to be 
corrected).


On Mon, Oct 24, 2016 at 1:49 PM, MattK  wrote:

I have a cluster that receives log files in a csv format on a 
per-minute

basis, and those files are immediately available to Drill users. For
performance I create Parquet files from them in batch using CTAS 
commands.


I would like to script a process that makes the Parquet files 
available on
creation, perhaps through a UNION view, but that does not serve 
duplicate
data through both an original csv and converted Parquet file at the 
same

time.

Is there a common practice to making data available once converted, 
in
something similar to a transactional batch of "convert then (re)move 
source

csv files" ?



Re: "Transactional" conversion of CSV to Parquet?

2016-10-24 Thread Ted Dunning
Yeah... it is quite doable. It helps a bit to have hard links.

The basic idea is to have one symbolic link that points to either of two
ping-pong staging directories. Whichever staging directory the symbolic
points to is called the active staging directory, the other is called
inactive.

To insert new CSV data, move data into the inactive staging directory. Then
create a hard link to the same file in the active staging directory. The
new data will appear atomically.

To consolidate old data to parquet, do the conversion in the inactive
staging directory. After the conversion succeeds, delete the csv file from
the inactive directory. Because the active directory has a hard link to the
csv file, it won't vanish from there. Then flip the symbolic link to point
to the (old) inactive directory. This makes the two operations of the csv
disappearing and the corresponding parquet file appearing happen in an
atomic moment.

The keys here are the hard links and the ping-ponging of staging
directories. If you just have a staging directory, then you won't have
atomic deletion of the csv file and creation of the parquet file.

Another subtlety here is the use of a symbolic link to point to the active
directory. This means that whenever you read a directory contents, you
should get one staging directory or the other and thus a scan will give you
*either* csv or parquet but *not* both.

All of this is trivial on a conventional file system or on MapR. Don't
think it works out of the box on HDFS (but am willing to be corrected).


On Mon, Oct 24, 2016 at 1:49 PM, MattK  wrote:

> I have a cluster that receives log files in a csv format on a per-minute
> basis, and those files are immediately available to Drill users. For
> performance I create Parquet files from them in batch using CTAS commands.
>
> I would like to script a process that makes the Parquet files available on
> creation, perhaps through a UNION view, but that does not serve duplicate
> data through both an original csv and converted Parquet file at the same
> time.
>
> Is there a common practice to making data available once converted, in
> something similar to a transactional batch of "convert then (re)move source
> csv files" ?
>


Re: "Transactional" conversion of CSV to Parquet?

2016-10-24 Thread Jim Scott
I would think that the best way to accommodate this would be:
When landing the CSV, place it into folder A, then convert them to parquet
format and put them in folder B...

This will give you isolation between the file formats, and you can then
choose to only query the parquet files. This is the simplest way to
guarantee that you don't double count.

Alternatively you could probably try to get fancy by leveraging the newer
feature which exposes the filename as part of the query results:
https://issues.apache.org/jira/browse/DRILL-3474

I can't really begin to think of the explicit SQL syntax you would need in
order to get inputfile.csv and not inputfile.parquet, etc... I would think
it would be rather complicated.



On Mon, Oct 24, 2016 at 3:49 PM, MattK  wrote:

> I have a cluster that receives log files in a csv format on a per-minute
> basis, and those files are immediately available to Drill users. For
> performance I create Parquet files from them in batch using CTAS commands.
>
> I would like to script a process that makes the Parquet files available on
> creation, perhaps through a UNION view, but that does not serve duplicate
> data through both an original csv and converted Parquet file at the same
> time.
>
> Is there a common practice to making data available once converted, in
> something similar to a transactional batch of "convert then (re)move source
> csv files" ?
>



-- 
*Jim Scott*
Director, Enterprise Strategy & Architecture
+1 (347) 746-9281
@kingmesal 


[image: MapR Technologies] 

Now Available - Free Hadoop On-Demand Training