We have an existing ETL framework processing machine generated data, which we are updating to write Parquet files out directly to HDFS using AvroParquetWriter for access by Drill.
Some questions: How do we take advantage of Drill’s partition pruning capabilities with PARTITION BY if we are not using CTAS to load the Parquet files ? It seems there is no way of taking advantage of these features if the Parquet files are created externally to CTAS - am I correct ? If I am, then is there any way using a Drill API of programatically loading our data into Parquet files and utilise Drill's parallelisation techniques using CTAS, or do we have to write the data out to a file and then load that file again as input to a CTAS command ? Another potential issue is that we are constantly writing Parquet files out to HDFS directories so the data in these files eventually appears as additional data in a Drill query - so how can we do this with CTAS ? Does CTAS append to an existing directory structure or does it insist on a new table name each time it is executed ? What I am getting at here is that there seem to be performance enhancement features available to Drill when the Parquet files are created using an existing file as input to a CTAS that are not possible otherwise. With the volumes of data we are talking about it is not really an option to write the files out, form them to then be read back in again for conversion using CTAS; which is why we write the Parquet files out directly to HDFS and append them to existing directories. Am I missing something obvious here - quite possibly yes ? Thanks for any help. Cheers — Chris
