Hi, Thanks for details. It's the point, I don't want to write additional metadata, but just organize the parquet file to have more useful stats.
In a simple GROUP BY it's possible to not SELECT some of "grouped" column. (Example SELECT a, b FROM ... GROUP BY a, b, c;) In the same way, I think it will be useful to have the possibilities to PARTITION BY and say (after the PARTITION BY) which columns put in the parquet file. I can imagine, that it will be difficult to add a "SELECT" close in PARTITION BY clauses like : [... PARTITION BY (a (SELECT ONLY b, c)) AS SELECT a, b, c FROM ...] Intermediately, allow the possibilities to "PARTITION BY" a computation on an existing column seem more accessible and should be useful. example : [... PARTITION BY (a < 10) AS SELECT a, b, c FROM ...] Because it's useless to store the "a < 10" but it may be useful to filer half of the parquet when request with condition on column a. But maybe there is a fundamental reason to not allow these facilities. Regards, Benj Le 06/12/2018 à 16:42, Anton Gozhiy a écrit : > Hi Benj, > > Creating partitions as in your first example won't work. >>From the docs: "During partitioning, Drill creates separate files, but not > separate directories, for different partitions." ( > https://drill.apache.org/docs/how-to-partition-data/). > Also, Drill doesn't write additional metadata regarding partitioning, when > it reads parquet files it determines partitions using min/max values. > That means that if you want for example to partition using the first > letter, you'll need to create a corresponding column. Or you can create > partitions manually as directories. > > On Wed, Dec 5, 2018 at 10:07 PM <[email protected]> wrote: > >> In would like to create a parquet with a partition on computed data >> (without to have to put the result of the computation in the parquet) : >> The goal is to optimize the parquet for typical expecting queries. >> >> Imaginary example : >> CREATE TABLE `mytable` >> PARTITION BY (substr(name,1,1)) AS >> SELECT name, birthdate, birthcity >> ORDER BY bithdate; >> >> So, if I do that I obtain a VALIDATION ERROR: Partition column ... is not >> in the SELECT list of CTAS >> >> And the comment of the code of the function >> "public static RelNode qualifyPartitionCol(RelNode input, List<String> >> partitionColumns)" >> confirms that it's not possible actually : >> " A partition column is resolved, either (1) the same column appear in the >> select list of CTAS or (2) CTAS has a * in select list" >> >> But what is the reason of this limitation ? >> Is there exists any tricks to do it right now, or can we expect an >> evolution to allow this possibilities. >> >> I just imagine to do (with the data of the example) >> CREATE TABLE `mytable` >> PARTITION BY (sname) AS >> SELECT substr(name,1,1) sname, name, birthdate, birthcity >> ORDER BY bithdate; >> Then, next, request each partition file to remove the useless data >> , like >> CREATE TABLE `mytable_2/partition_x` >> SELECT name, birthdate, birthcity >> ORDER BY bithdate; >> but it's not really satisfying... >> >> I would appreciate yours comments, >> Regards, >> >> benj >> > >
