If most of your queries use date column in the filter condition, I would partition the data on the date column. Then you can simply say
select * from events where `date` between '2016-11-11' and '2017-01-23'; - Rahul On Wed, May 31, 2017 at 3:22 PM, Raz Baluchi <[email protected]> wrote: > So, if I understand you correctly, I would have to include the 'yr' and > 'mnth' columns in addition to the 'date' column in the query? > > e.g. > > select * from events where yr in (2016, 2017) and mnth in (11,12,1) and > date between '2016-11-11' and '2017-01-23'; > > Is that correct? > > On Wed, May 31, 2017 at 4:49 PM, rahul challapalli < > [email protected]> wrote: > > > How to partition data is dependent on how you want to access your data. > If > > you can foresee that most of the queries use year and month, then > go-ahead > > and partition the data on those 2 columns. You can do that like below > > > > create table partitioned_data partition by (yr, mnth) as select > > extract(year from `date`) yr, extract(month from `date`) mnth, `date`, > > ........ from mydata; > > > > For partitioning to have any benefit, your queries should have filters on > > month and year columns. > > > > - Rahul > > > > On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi <[email protected]> > > wrote: > > > > > Hi all, > > > > > > Trying to understand parquet partitioning works. > > > > > > What is the recommended partitioning scheme for event data that will be > > > queried primarily by date. I assume that partitioning by year and month > > > would be optimal? > > > > > > Lets say I have data that looks like: > > > > > > application,status,date,message > > > kafka,down,2017-03023 04:53,zookeeper is not available > > > > > > > > > Would I have to create new columns for year and month? > > > > > > e.g. > > > application,status,date,message,year,month > > > kafka,down,2017-03023 04:53,zookeeper is not available,2017,03 > > > > > > and then perform a CTAS using the year and month columns as the > > 'partition > > > by'? > > > > > > Thanks > > > > > >
