Are you running same query on both tables ? What is the filter condition ? Since they are partitioned differently, same filter may prune the files differently. If possible, can you share query profiles ? You can check query profiles to see how many rows are being read from disk in both cases.
Thanks, Padma > On May 31, 2017, at 6:15 PM, Raz Baluchi <[email protected]> wrote: > > As an experiment, I created an event file will 100 million entries spanning > 25 years. I then created tables both ways, one partitioned by year and > month and the other by date. The first table created 410 parquet files and > the second 11837. > > Querying the first table is consistently faster by a factor of 2x to 10x, > > Is this because drill is not very efficient at querying a large number of > small(ish) parquet files? > > On Wed, May 31, 2017 at 6:42 PM, rahul challapalli < > [email protected]> wrote: > >> If most of your queries use date column in the filter condition, I would >> partition the data on the date column. Then you can simply say >> >> select * from events where `date` between '2016-11-11' and '2017-01-23'; >> >> - Rahul >> >> On Wed, May 31, 2017 at 3:22 PM, Raz Baluchi <[email protected]> >> wrote: >> >>> So, if I understand you correctly, I would have to include the 'yr' and >>> 'mnth' columns in addition to the 'date' column in the query? >>> >>> e.g. >>> >>> select * from events where yr in (2016, 2017) and mnth in (11,12,1) and >>> date between '2016-11-11' and '2017-01-23'; >>> >>> Is that correct? >>> >>> On Wed, May 31, 2017 at 4:49 PM, rahul challapalli < >>> [email protected]> wrote: >>> >>>> How to partition data is dependent on how you want to access your data. >>> If >>>> you can foresee that most of the queries use year and month, then >>> go-ahead >>>> and partition the data on those 2 columns. You can do that like below >>>> >>>> create table partitioned_data partition by (yr, mnth) as select >>>> extract(year from `date`) yr, extract(month from `date`) mnth, `date`, >>>> ........ from mydata; >>>> >>>> For partitioning to have any benefit, your queries should have filters >> on >>>> month and year columns. >>>> >>>> - Rahul >>>> >>>> On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi <[email protected]> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> Trying to understand parquet partitioning works. >>>>> >>>>> What is the recommended partitioning scheme for event data that will >> be >>>>> queried primarily by date. I assume that partitioning by year and >> month >>>>> would be optimal? >>>>> >>>>> Lets say I have data that looks like: >>>>> >>>>> application,status,date,message >>>>> kafka,down,2017-03023 04:53,zookeeper is not available >>>>> >>>>> >>>>> Would I have to create new columns for year and month? >>>>> >>>>> e.g. >>>>> application,status,date,message,year,month >>>>> kafka,down,2017-03023 04:53,zookeeper is not available,2017,03 >>>>> >>>>> and then perform a CTAS using the year and month columns as the >>>> 'partition >>>>> by'? >>>>> >>>>> Thanks >>>>> >>>> >>> >>
