RE: Merge and save parquet files in Drill

Kunal Khatua Fri, 18 Aug 2017 10:53:56 -0700

If you are creating lot of small files within each partition, that is because 
different writers write a file for each partition.

One workaround would be to reduce the number of writer fragments. We can 
achieve that by setting the parameter to a lower value, so that you generate 
larger files. There might be a slight trade off with performance, but should by 
and large be an improvement.

planner.width.max_per_node 
  and 
planner.width.max_per_query

-----Original Message-----
From: Divya Gehlot [mailto:divya.htco...@gmail.com] 
Sent: Thursday, August 17, 2017 6:57 PM
To: user@drill.apache.org
Subject: Re: Merge and save parquet files in Drill

Hi,

No way we can merge the files in Drill if creates lots of small files ?
AFAIK , partitioning improves the performance as in my case partitioning is 
based on year,month,day.hour and querying the data also keeping partitioning 
column values in where clause .
It should just go and read those files and eventually improves the query 
performance.
In this use case shouldnt matter whether it creates small files or big files 
until we query on non partition column.

Can somebody put light on my understanding on Apache Drill ?

Thanks,
Divya

On 17 August 2017 at 22:55, Andries Engelbrecht <aengelbre...@mapr.com>
wrote:

> Do you partition the table?
> You may want to sort (order by) on the columns you partition, or just 
> order by in any case on the column(s) you are most likely going to use 
> for predicates. It increases the CTAS time, but normally will improve 
> the query performance quite a bit.
>
> Yes a large number of files does affect the query performance, using 
> metadata caching helps improve the query planning time a lot.
>
> --Andries
>
>
> On 8/16/17, 11:12 PM, "Divya Gehlot" <divya.htco...@gmail.com> wrote:
>
>     Hi,
>     I have CTAS with partition on 4 columns and when I save it it 
> creates lots
>     of small files ~ 102290 where size of each file is in KBs .
>
>     My queries are :
>     1.Does the lots of small files reduce the performance while reading the
>     data in Drill ?
>     2.If yes ,How can I merge the small parquet files ?
>
>
>
>     Thanks,
>     Divya
>
>
>

RE: Merge and save parquet files in Drill

Reply via email to