So, if I understand you correctly, I would have to include the 'yr' and 'mnth' columns in addition to the 'date' column in the query?
e.g. select * from events where yr in (2016, 2017) and mnth in (11,12,1) and date between '2016-11-11' and '2017-01-23'; Is that correct? On Wed, May 31, 2017 at 4:49 PM, rahul challapalli < [email protected]> wrote: > How to partition data is dependent on how you want to access your data. If > you can foresee that most of the queries use year and month, then go-ahead > and partition the data on those 2 columns. You can do that like below > > create table partitioned_data partition by (yr, mnth) as select > extract(year from `date`) yr, extract(month from `date`) mnth, `date`, > ........ from mydata; > > For partitioning to have any benefit, your queries should have filters on > month and year columns. > > - Rahul > > On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi <[email protected]> > wrote: > > > Hi all, > > > > Trying to understand parquet partitioning works. > > > > What is the recommended partitioning scheme for event data that will be > > queried primarily by date. I assume that partitioning by year and month > > would be optimal? > > > > Lets say I have data that looks like: > > > > application,status,date,message > > kafka,down,2017-03023 04:53,zookeeper is not available > > > > > > Would I have to create new columns for year and month? > > > > e.g. > > application,status,date,message,year,month > > kafka,down,2017-03023 04:53,zookeeper is not available,2017,03 > > > > and then perform a CTAS using the year and month columns as the > 'partition > > by'? > > > > Thanks > > >
