Benj, I meant that without the metadata Drill won't recognize files as partitions. Although that should not be a problem for the optimization mechanisms.
As for your original question, I think that it's rather not implemented than an intended limitation. Feel free to submit a feature request to Apache JIRA. On Fri, Dec 7, 2018 at 9:20 PM benj.dev <[email protected]> wrote: > Hi, > > Thanks for details. > It's the point, I don't want to write additional metadata, > but just organize the parquet file to have more useful stats. > > In a simple GROUP BY it's possible to not SELECT some of "grouped" column. > (Example SELECT a, b FROM ... GROUP BY a, b, c;) > In the same way, I think it will be useful to have the possibilities > to PARTITION BY and say (after the PARTITION BY) which columns put in > the parquet file. > > I can imagine, that it will be difficult to add a "SELECT" close in > PARTITION BY clauses like : > [... PARTITION BY (a (SELECT ONLY b, c)) AS SELECT a, b, c FROM ...] > > Intermediately, allow the possibilities to "PARTITION BY" a computation > on an existing column seem more accessible and should be useful. > example : > [... PARTITION BY (a < 10) AS SELECT a, b, c FROM ...] > > Because it's useless to store the "a < 10" but it may be useful to filer > half of the parquet when request with condition on column a. > > But maybe there is a fundamental reason to not allow these facilities. > > Regards, > Benj > > Le 06/12/2018 à 16:42, Anton Gozhiy a écrit : > > Hi Benj, > > > > Creating partitions as in your first example won't work. > >>From the docs: "During partitioning, Drill creates separate files, but > not > > separate directories, for different partitions." ( > > https://drill.apache.org/docs/how-to-partition-data/). > > Also, Drill doesn't write additional metadata regarding partitioning, > when > > it reads parquet files it determines partitions using min/max values. > > That means that if you want for example to partition using the first > > letter, you'll need to create a corresponding column. Or you can create > > partitions manually as directories. > > > > On Wed, Dec 5, 2018 at 10:07 PM <[email protected]> wrote: > > > >> In would like to create a parquet with a partition on computed data > >> (without to have to put the result of the computation in the parquet) : > >> The goal is to optimize the parquet for typical expecting queries. > >> > >> Imaginary example : > >> CREATE TABLE `mytable` > >> PARTITION BY (substr(name,1,1)) AS > >> SELECT name, birthdate, birthcity > >> ORDER BY bithdate; > >> > >> So, if I do that I obtain a VALIDATION ERROR: Partition column ... is > not > >> in the SELECT list of CTAS > >> > >> And the comment of the code of the function > >> "public static RelNode qualifyPartitionCol(RelNode input, List<String> > >> partitionColumns)" > >> confirms that it's not possible actually : > >> " A partition column is resolved, either (1) the same column appear in > the > >> select list of CTAS or (2) CTAS has a * in select list" > >> > >> But what is the reason of this limitation ? > >> Is there exists any tricks to do it right now, or can we expect an > >> evolution to allow this possibilities. > >> > >> I just imagine to do (with the data of the example) > >> CREATE TABLE `mytable` > >> PARTITION BY (sname) AS > >> SELECT substr(name,1,1) sname, name, birthdate, birthcity > >> ORDER BY bithdate; > >> Then, next, request each partition file to remove the useless data > >> , like > >> CREATE TABLE `mytable_2/partition_x` > >> SELECT name, birthdate, birthcity > >> ORDER BY bithdate; > >> but it's not really satisfying... > >> > >> I would appreciate yours comments, > >> Regards, > >> > >> benj > >> > > > > > -- Sincerely, Anton Gozhiy [email protected]
