We create a JSON format schema for the Parquet file using the Avro specification and use this schema when loading data.
Is there anything special we have to do to flag a column as a partitioning column ? Sorry I don’t understand your answer. What do you mean by ‘discover the columns with a single value’ ? Cheers — Chris > On 21 Oct 2015, at 20:02, Mehant Baid <[email protected]> wrote: > > The information is stored in the footer of the parquet files. Drill reads the > metadata information stored in the parquet footer to discover the columns > with a single value and treats them as partitioning columns. > > Thanks > Mehant > > On 10/21/15 11:52 AM, Chris Mathews wrote: >> Thank Mehant; yes we did look at doing this, but the advantages of using the >> new PARTITION BY feature is that the partitioned columns are automatically >> detected during any subsequent queries. This is a major advantage as our >> customers are using the Tableau BI tool, and knowing details such as the >> exact partition levels and directories is not an option. >> >> By the way, having created a table using PARTITION BY and CTAS ,how does a >> query know how to action the pruning ? Where is this information stored for >> the query to access the tables/files efficiently ? >> >> Cheers — Chris >> >>> On 21 Oct 2015, at 19:37, Mehant Baid <[email protected]> wrote: >>> >>> In addition to the auto partitioning done by CTAS, Drill also supports >>> directory based pruning. You could load data into different(nested) >>> directories underneath the top level table location and use the 'where' >>> clause to get the pruning performance benefits. Following is a typical >>> example >>> >>> Table location: /home/user/table_name >>> Within this you could create nested directory structure of the form >>> /home/user/table_name/2010/jan >>> /home/user/table_name/2010/feb >>> ... >>> /home/user/table_name/2010/dec >>> >>> /home/user/table_name/2011/jan >>> ... >>> /home/user/table_name/2011/dec >>> >>> Given this directory structure you could have a query that looks like >>> >>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and dir1 = >>> jan; >>> >>> This would prune out scanning the parquet files under the other directories. >>> >>> Thanks >>> Mehant >>> On 10/21/15 11:26 AM, Chris Mathews wrote: >>>> We have an existing ETL framework processing machine generated data, which >>>> we are updating to write Parquet files out directly to HDFS using >>>> AvroParquetWriter for access by Drill. >>>> >>>> Some questions: >>>> >>>> How do we take advantage of Drill’s partition pruning capabilities with >>>> PARTITION BY if we are not using CTAS to load the Parquet files ? >>>> >>>> It seems there is no way of taking advantage of these features if the >>>> Parquet files are created externally to CTAS - am I correct ? >>>> >>>> If I am, then is there any way using a Drill API of programatically >>>> loading our data into Parquet files and utilise Drill's parallelisation >>>> techniques using CTAS, or do we have to write the data out to a file and >>>> then load that file again as input to a CTAS command ? >>>> >>>> Another potential issue is that we are constantly writing Parquet files >>>> out to HDFS directories so the data in these files eventually appears as >>>> additional data in a Drill query - so how can we do this with CTAS ? Does >>>> CTAS append to an existing directory structure or does it insist on a new >>>> table name each time it is executed ? >>>> >>>> What I am getting at here is that there seem to be performance enhancement >>>> features available to Drill when the Parquet files are created using an >>>> existing file as input to a CTAS that are not possible otherwise. With >>>> the volumes of data we are talking about it is not really an option to >>>> write the files out, form them to then be read back in again for >>>> conversion using CTAS; which is why we write the Parquet files out >>>> directly to HDFS and append them to existing directories. >>>> >>>> Am I missing something obvious here - quite possibly yes ? >>>> >>>> Thanks for any help. >>>> >>>> Cheers — Chris >>>> >
