Re: Partitioning for parquet

Raz Baluchi Wed, 31 May 2017 15:22:53 -0700

So, if I understand you correctly, I would have to include the 'yr' and
'mnth' columns in addition to the 'date' column in the query?


e.g.

select * from events where yr in (2016, 2017)  and mnth in (11,12,1) and
date between '2016-11-11' and '2017-01-23';

Is that correct?

On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
[email protected]> wrote:

> How to partition data is dependent on how you want to access your data. If
> you can foresee that most of the queries use year and month, then go-ahead
> and partition the data on those 2 columns. You can do that like below
>
> create table partitioned_data partition by (yr, mnth) as select
> extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
> ........ from mydata;
>
> For partitioning to have any benefit, your queries should have filters on
> month and year columns.
>
> - Rahul
>
> On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi <[email protected]>
> wrote:
>
> > Hi all,
> >
> > Trying to understand parquet partitioning works.
> >
> > What is the recommended partitioning scheme for event data that will be
> > queried primarily by date. I assume that partitioning by year and month
> > would be optimal?
> >
> > Lets say I have data that looks like:
> >
> > application,status,date,message
> > kafka,down,2017-03023 04:53,zookeeper is not available
> >
> >
> > Would I have to create new columns for year and month?
> >
> > e.g.
> > application,status,date,message,year,month
> > kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
> >
> > and then perform a CTAS using the year and month columns as the
> 'partition
> > by'?
> >
> > Thanks
> >
>

Re: Partitioning for parquet

Reply via email to