How to partition data is dependent on how you want to access your data. If you can foresee that most of the queries use year and month, then go-ahead and partition the data on those 2 columns. You can do that like below
create table partitioned_data partition by (yr, mnth) as select extract(year from `date`) yr, extract(month from `date`) mnth, `date`, ........ from mydata; For partitioning to have any benefit, your queries should have filters on month and year columns. - Rahul On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi <[email protected]> wrote: > Hi all, > > Trying to understand parquet partitioning works. > > What is the recommended partitioning scheme for event data that will be > queried primarily by date. I assume that partitioning by year and month > would be optimal? > > Lets say I have data that looks like: > > application,status,date,message > kafka,down,2017-03023 04:53,zookeeper is not available > > > Would I have to create new columns for year and month? > > e.g. > application,status,date,message,year,month > kafka,down,2017-03023 04:53,zookeeper is not available,2017,03 > > and then perform a CTAS using the year and month columns as the 'partition > by'? > > Thanks >
