Re: Partitioning for parquet

rahul challapalli Wed, 31 May 2017 15:43:50 -0700

If most of your queries use date column in the filter condition, I would
partition the data on the date column. Then you can simply say


select * from events where `date` between '2016-11-11' and '2017-01-23';

- Rahul

On Wed, May 31, 2017 at 3:22 PM, Raz Baluchi <[email protected]> wrote:

> So, if I understand you correctly, I would have to include the 'yr' and
> 'mnth' columns in addition to the 'date' column in the query?
>
> e.g.
>
> select * from events where yr in (2016, 2017)  and mnth in (11,12,1) and
> date between '2016-11-11' and '2017-01-23';
>
> Is that correct?
>
> On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
> [email protected]> wrote:
>
> > How to partition data is dependent on how you want to access your data.
> If
> > you can foresee that most of the queries use year and month, then
> go-ahead
> > and partition the data on those 2 columns. You can do that like below
> >
> > create table partitioned_data partition by (yr, mnth) as select
> > extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
> > ........ from mydata;
> >
> > For partitioning to have any benefit, your queries should have filters on
> > month and year columns.
> >
> > - Rahul
> >
> > On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi <[email protected]>
> > wrote:
> >
> > > Hi all,
> > >
> > > Trying to understand parquet partitioning works.
> > >
> > > What is the recommended partitioning scheme for event data that will be
> > > queried primarily by date. I assume that partitioning by year and month
> > > would be optimal?
> > >
> > > Lets say I have data that looks like:
> > >
> > > application,status,date,message
> > > kafka,down,2017-03023 04:53,zookeeper is not available
> > >
> > >
> > > Would I have to create new columns for year and month?
> > >
> > > e.g.
> > > application,status,date,message,year,month
> > > kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
> > >
> > > and then perform a CTAS using the year and month columns as the
> > 'partition
> > > by'?
> > >
> > > Thanks
> > >
> >
>

Re: Partitioning for parquet

Reply via email to