Re: drill parquet - create table as ... partition by ... non present column

Anton Gozhiy Mon, 10 Dec 2018 06:40:55 -0800

Benj,

I meant that without the metadata Drill won't recognize files as
partitions. Although that should not be a problem for the optimization
mechanisms.


As for your original question, I think that it's rather not implemented
than an intended limitation. Feel free to submit a feature request to
Apache JIRA.


On Fri, Dec 7, 2018 at 9:20 PM benj.dev <[email protected]> wrote:

> Hi,
>
> Thanks for details.
> It's the point, I don't want to write additional metadata,
> but just organize the parquet file to have more useful stats.
>
> In a simple GROUP BY it's possible to not SELECT some of "grouped" column.
> (Example SELECT a, b FROM ... GROUP BY a, b, c;)
> In the same way, I think it will be useful to have the possibilities
> to PARTITION BY and say (after the PARTITION BY) which columns put in
> the parquet file.
>
> I can imagine, that it will be difficult to add a "SELECT" close in
> PARTITION BY clauses like :
> [... PARTITION BY (a (SELECT ONLY b, c)) AS SELECT a, b, c FROM ...]
>
> Intermediately, allow the possibilities to "PARTITION BY" a computation
> on an existing column seem more accessible and should be useful.
> example :
> [... PARTITION BY (a < 10) AS SELECT a, b, c FROM ...]
>
> Because it's useless to store the "a < 10" but it may be useful to filer
> half of the parquet when request with condition on column a.
>
> But maybe there is a fundamental reason to not allow these facilities.
>
> Regards,
> Benj
>
> Le 06/12/2018 à 16:42, Anton Gozhiy a écrit :
> > Hi Benj,
> >
> > Creating partitions as in your first example won't work.
> >>From the docs: "During partitioning, Drill creates separate files, but
> not
> > separate directories, for different partitions." (
> > https://drill.apache.org/docs/how-to-partition-data/).
> > Also, Drill doesn't write additional metadata regarding partitioning,
> when
> > it reads parquet files it determines partitions using min/max values.
> > That means that if you want for example to partition using the first
> > letter, you'll need to create a corresponding column. Or you can create
> > partitions manually as directories.
> >
> > On Wed, Dec 5, 2018 at 10:07 PM <[email protected]> wrote:
> >
> >> In would like to create a parquet with a partition on computed data
> >> (without to have to put the result of the computation in the parquet) :
> >> The goal is to optimize the parquet for typical expecting queries.
> >>
> >> Imaginary example :
> >> CREATE TABLE `mytable`
> >> PARTITION BY (substr(name,1,1)) AS
> >> SELECT name, birthdate, birthcity
> >> ORDER BY bithdate;
> >>
> >> So, if I do that I obtain a VALIDATION ERROR: Partition column ... is
> not
> >> in the SELECT list of CTAS
> >>
> >> And the comment of the code of the function
> >> "public static RelNode qualifyPartitionCol(RelNode input, List<String>
> >> partitionColumns)"
> >> confirms that it's not possible actually :
> >> " A partition column is resolved, either (1) the same column appear in
> the
> >> select list of CTAS or (2) CTAS has a * in select list"
> >>
> >> But what is the reason of this limitation ?
> >> Is there exists any tricks to do it right now, or can we expect an
> >> evolution to allow this possibilities.
> >>
> >> I just imagine to do (with the data of the example)
> >> CREATE TABLE `mytable`
> >> PARTITION BY (sname) AS
> >> SELECT substr(name,1,1) sname, name, birthdate, birthcity
> >> ORDER BY bithdate;
> >> Then, next, request each partition file to remove the useless data
> >> , like
> >> CREATE TABLE `mytable_2/partition_x`
> >> SELECT name, birthdate, birthcity
> >> ORDER BY bithdate;
> >> but it's not really satisfying...
> >>
> >> I would appreciate yours comments,
> >> Regards,
> >>
> >> benj
> >>
> >
> >
>


-- 
Sincerely, Anton Gozhiy
[email protected]

Re: drill parquet - create table as ... partition by ... non present column

Reply via email to