As an experiment, I created an event file will 100 million entries spanning 25 years. I then created tables both ways, one partitioned by year and month and the other by date. The first table created 410 parquet files and the second 11837.
Querying the first table is consistently faster by a factor of 2x to 10x, Is this because drill is not very efficient at querying a large number of small(ish) parquet files? On Wed, May 31, 2017 at 6:42 PM, rahul challapalli < [email protected]> wrote: > If most of your queries use date column in the filter condition, I would > partition the data on the date column. Then you can simply say > > select * from events where `date` between '2016-11-11' and '2017-01-23'; > > - Rahul > > On Wed, May 31, 2017 at 3:22 PM, Raz Baluchi <[email protected]> > wrote: > > > So, if I understand you correctly, I would have to include the 'yr' and > > 'mnth' columns in addition to the 'date' column in the query? > > > > e.g. > > > > select * from events where yr in (2016, 2017) and mnth in (11,12,1) and > > date between '2016-11-11' and '2017-01-23'; > > > > Is that correct? > > > > On Wed, May 31, 2017 at 4:49 PM, rahul challapalli < > > [email protected]> wrote: > > > > > How to partition data is dependent on how you want to access your data. > > If > > > you can foresee that most of the queries use year and month, then > > go-ahead > > > and partition the data on those 2 columns. You can do that like below > > > > > > create table partitioned_data partition by (yr, mnth) as select > > > extract(year from `date`) yr, extract(month from `date`) mnth, `date`, > > > ........ from mydata; > > > > > > For partitioning to have any benefit, your queries should have filters > on > > > month and year columns. > > > > > > - Rahul > > > > > > On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi <[email protected]> > > > wrote: > > > > > > > Hi all, > > > > > > > > Trying to understand parquet partitioning works. > > > > > > > > What is the recommended partitioning scheme for event data that will > be > > > > queried primarily by date. I assume that partitioning by year and > month > > > > would be optimal? > > > > > > > > Lets say I have data that looks like: > > > > > > > > application,status,date,message > > > > kafka,down,2017-03023 04:53,zookeeper is not available > > > > > > > > > > > > Would I have to create new columns for year and month? > > > > > > > > e.g. > > > > application,status,date,message,year,month > > > > kafka,down,2017-03023 04:53,zookeeper is not available,2017,03 > > > > > > > > and then perform a CTAS using the year and month columns as the > > > 'partition > > > > by'? > > > > > > > > Thanks > > > > > > > > > >
