Hi,

Thanks for details.
It's the point, I don't want to write additional metadata,
but just organize the parquet file to have more useful stats.

In a simple GROUP BY it's possible to not SELECT some of "grouped" column.
(Example SELECT a, b FROM ... GROUP BY a, b, c;)
In the same way, I think it will be useful to have the possibilities
to PARTITION BY and say (after the PARTITION BY) which columns put in
the parquet file.

I can imagine, that it will be difficult to add a "SELECT" close in
PARTITION BY clauses like :
[... PARTITION BY (a (SELECT ONLY b, c)) AS SELECT a, b, c FROM ...]

Intermediately, allow the possibilities to "PARTITION BY" a computation
on an existing column seem more accessible and should be useful.
example :
[... PARTITION BY (a < 10) AS SELECT a, b, c FROM ...]

Because it's useless to store the "a < 10" but it may be useful to filer
half of the parquet when request with condition on column a.

But maybe there is a fundamental reason to not allow these facilities.

Regards,
Benj

Le 06/12/2018 à 16:42, Anton Gozhiy a écrit :
> Hi Benj,
> 
> Creating partitions as in your first example won't work.
>>From the docs: "During partitioning, Drill creates separate files, but not
> separate directories, for different partitions." (
> https://drill.apache.org/docs/how-to-partition-data/).
> Also, Drill doesn't write additional metadata regarding partitioning, when
> it reads parquet files it determines partitions using min/max values.
> That means that if you want for example to partition using the first
> letter, you'll need to create a corresponding column. Or you can create
> partitions manually as directories.
> 
> On Wed, Dec 5, 2018 at 10:07 PM <[email protected]> wrote:
> 
>> In would like to create a parquet with a partition on computed data
>> (without to have to put the result of the computation in the parquet) :
>> The goal is to optimize the parquet for typical expecting queries.
>>
>> Imaginary example :
>> CREATE TABLE `mytable`
>> PARTITION BY (substr(name,1,1)) AS
>> SELECT name, birthdate, birthcity
>> ORDER BY bithdate;
>>
>> So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not
>> in the SELECT list of CTAS
>>
>> And the comment of the code of the function
>> "public static RelNode qualifyPartitionCol(RelNode input, List<String>
>> partitionColumns)"
>> confirms that it's not possible actually :
>> " A partition column is resolved, either (1) the same column appear in the
>> select list of CTAS or (2) CTAS has a * in select list"
>>
>> But what is the reason of this limitation ?
>> Is there exists any tricks to do it right now, or can we expect an
>> evolution to allow this possibilities.
>>
>> I just imagine to do (with the data of the example)
>> CREATE TABLE `mytable`
>> PARTITION BY (sname) AS
>> SELECT substr(name,1,1) sname, name, birthdate, birthcity
>> ORDER BY bithdate;
>> Then, next, request each partition file to remove the useless data
>> , like
>> CREATE TABLE `mytable_2/partition_x`
>> SELECT name, birthdate, birthcity
>> ORDER BY bithdate;
>> but it's not really satisfying...
>>
>> I would appreciate yours comments,
>> Regards,
>>
>> benj
>>
> 
> 

Reply via email to