Re: Best way to partition the data

Divya Gehlot Mon, 04 Sep 2017 18:49:33 -0700

Hi,

I also face the similar  issue like JinFeng  when querying the data on
columns  year,month and day which were partioning column too .
It created lots of small files and querying took almost 20x times more than
reading non partiioning data .
Another issue I faced when I query the data by just filtering the partioned
column data it read concernded files like just 4 files.
When I exceuted Refresh metdata cache it read all the files .


Thanks,
Divya

On 2 September 2017 at 01:28, Jinfeng Ni <j...@apache.org> wrote:

> If you have small cardinality for partitioning column, yet still end up
> with 50k different small files, it's possible that you have many parallel
> writer minor-fragment (threads).  By default, each writer minor-fragment
> will work independently. If you have cardinailty C and N writer minor
> fragment, you could end up with up to C*N small files.
>
> There are two possible solutions.
>
> 1) You may consider turning the following option to true. This will add
> network communication/cpu cost, yet it will reduce the # of files to C.
>
> alter session set `store.partition.hash_distribute` = true;   //default is
> false.
>
> 2) Reduce the parallel writer minor-fragment by tuning other parameter
> before you run CTAS partition statement.
>
> For partition pruning, Drill works on row group level, not at page level.
>
>
>
>
>
> On Fri, Sep 1, 2017 at 9:02 AM, Padma Penumarthy <ppenumar...@mapr.com>
> wrote:
>
> > Have you tried building metadata cache file using "refresh table
> metadata”
> > command ?
> > That will help reduce the planning time. Is most of the time spent in
> > planning or execution ?
> >
> > Pruning is done at  rowgroup level i.e. at file level (we create one file
> > per rowgroup).
> > We do not support pruning at page level.
> > I am thinking if it created 50K files, it means your cardinality is high.
> > You might want to
> > consider putting some directory hierarchy in place for ex. you can create
> > a directory
> > for each unique value of column 1 and a file for each unique value of
> > column 2 underneath.
> > If partition is done correctly, depending upon the filters, we should not
> > read more
> > rowgroups than what is needed.
> >
> > Thanks,
> > Padma
> >
> >
> >
> > On Sep 1, 2017, at 6:54 AM, Damien Profeta <damien.prof...@amadeus.com<
> > mailto:damien.prof...@amadeus.com>> wrote:
> >
> > Hello,
> >
> > I have a dataset that I always query on 2 columns that don't have a big
> > cardinality. So to benefit from pruning, I tried to partition the file on
> > these keys, but I end up with 50k differents small file (30Mo) and query
> on
> > it spend most of the time in the planning phase, to decode the metadata
> > file, resolve the absolute path…
> >
> > By looking at the parquet file structure, I saw that there are statistics
> > at page level and chunk level. So I tried to generated parquet file
> where a
> > page is dedicated for one value for the 2 partition column. By using the
> > statistics, Drill could be able to drop the page/chunk.
> > But it seems Drill is not making any use of the statistics in the parquet
> > file because, whatever the query I do, I don't see any change in the
> number
> > of page loaded.
> >
> > Do you confirm my conclusion? What would be the best way to organize the
> > data so that Drill doesn't read the data that can be pruned easily
> >
> > Thanks
> > Damien
> >
> >
>

Re: Best way to partition the data

Reply via email to