Re: Partitioning for parquet

Padma Penumarthy Wed, 31 May 2017 21:34:37 -0700

Are you running same query on both tables ? What is the filter condition ?
Since they are partitioned differently, same filter may prune the files 
differently.
If possible, can you share query profiles ?
You can check query profiles to see how many rows are being read from disk
in both cases.


Thanks,
Padma


> On May 31, 2017, at 6:15 PM, Raz Baluchi <[email protected]> wrote:
> 
> As an experiment, I created an event file will 100 million entries spanning
> 25 years. I then created tables both ways, one partitioned by year and
> month and the other by date. The first table created 410 parquet files and
> the second 11837.
> 
> Querying the first table is consistently faster by a factor of 2x to 10x,
> 
> Is this because drill is not very efficient at querying a large number of
> small(ish) parquet files?
> 
> On Wed, May 31, 2017 at 6:42 PM, rahul challapalli <
> [email protected]> wrote:
> 
>> If most of your queries use date column in the filter condition, I would
>> partition the data on the date column. Then you can simply say
>> 
>> select * from events where `date` between '2016-11-11' and '2017-01-23';
>> 
>> - Rahul
>> 
>> On Wed, May 31, 2017 at 3:22 PM, Raz Baluchi <[email protected]>
>> wrote:
>> 
>>> So, if I understand you correctly, I would have to include the 'yr' and
>>> 'mnth' columns in addition to the 'date' column in the query?
>>> 
>>> e.g.
>>> 
>>> select * from events where yr in (2016, 2017)  and mnth in (11,12,1) and
>>> date between '2016-11-11' and '2017-01-23';
>>> 
>>> Is that correct?
>>> 
>>> On Wed, May 31, 2017 at 4:49 PM, rahul challapalli <
>>> [email protected]> wrote:
>>> 
>>>> How to partition data is dependent on how you want to access your data.
>>> If
>>>> you can foresee that most of the queries use year and month, then
>>> go-ahead
>>>> and partition the data on those 2 columns. You can do that like below
>>>> 
>>>> create table partitioned_data partition by (yr, mnth) as select
>>>> extract(year from `date`) yr, extract(month from `date`) mnth, `date`,
>>>> ........ from mydata;
>>>> 
>>>> For partitioning to have any benefit, your queries should have filters
>> on
>>>> month and year columns.
>>>> 
>>>> - Rahul
>>>> 
>>>> On Wed, May 31, 2017 at 1:28 PM, Raz Baluchi <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Trying to understand parquet partitioning works.
>>>>> 
>>>>> What is the recommended partitioning scheme for event data that will
>> be
>>>>> queried primarily by date. I assume that partitioning by year and
>> month
>>>>> would be optimal?
>>>>> 
>>>>> Lets say I have data that looks like:
>>>>> 
>>>>> application,status,date,message
>>>>> kafka,down,2017-03023 04:53,zookeeper is not available
>>>>> 
>>>>> 
>>>>> Would I have to create new columns for year and month?
>>>>> 
>>>>> e.g.
>>>>> application,status,date,message,year,month
>>>>> kafka,down,2017-03023 04:53,zookeeper is not available,2017,03
>>>>> 
>>>>> and then perform a CTAS using the year and month columns as the
>>>> 'partition
>>>>> by'?
>>>>> 
>>>>> Thanks
>>>>> 
>>>> 
>>> 
>>

Re: Partitioning for parquet

Reply via email to