Hi, I also face the similar issue like JinFeng when querying the data on columns year,month and day which were partioning column too . It created lots of small files and querying took almost 20x times more than reading non partiioning data . Another issue I faced when I query the data by just filtering the partioned column data it read concernded files like just 4 files. When I exceuted Refresh metdata cache it read all the files .
Thanks, Divya On 2 September 2017 at 01:28, Jinfeng Ni <j...@apache.org> wrote: > If you have small cardinality for partitioning column, yet still end up > with 50k different small files, it's possible that you have many parallel > writer minor-fragment (threads). By default, each writer minor-fragment > will work independently. If you have cardinailty C and N writer minor > fragment, you could end up with up to C*N small files. > > There are two possible solutions. > > 1) You may consider turning the following option to true. This will add > network communication/cpu cost, yet it will reduce the # of files to C. > > alter session set `store.partition.hash_distribute` = true; //default is > false. > > 2) Reduce the parallel writer minor-fragment by tuning other parameter > before you run CTAS partition statement. > > For partition pruning, Drill works on row group level, not at page level. > > > > > > On Fri, Sep 1, 2017 at 9:02 AM, Padma Penumarthy <ppenumar...@mapr.com> > wrote: > > > Have you tried building metadata cache file using "refresh table > metadata” > > command ? > > That will help reduce the planning time. Is most of the time spent in > > planning or execution ? > > > > Pruning is done at rowgroup level i.e. at file level (we create one file > > per rowgroup). > > We do not support pruning at page level. > > I am thinking if it created 50K files, it means your cardinality is high. > > You might want to > > consider putting some directory hierarchy in place for ex. you can create > > a directory > > for each unique value of column 1 and a file for each unique value of > > column 2 underneath. > > If partition is done correctly, depending upon the filters, we should not > > read more > > rowgroups than what is needed. > > > > Thanks, > > Padma > > > > > > > > On Sep 1, 2017, at 6:54 AM, Damien Profeta <damien.prof...@amadeus.com< > > mailto:damien.prof...@amadeus.com>> wrote: > > > > Hello, > > > > I have a dataset that I always query on 2 columns that don't have a big > > cardinality. So to benefit from pruning, I tried to partition the file on > > these keys, but I end up with 50k differents small file (30Mo) and query > on > > it spend most of the time in the planning phase, to decode the metadata > > file, resolve the absolute path… > > > > By looking at the parquet file structure, I saw that there are statistics > > at page level and chunk level. So I tried to generated parquet file > where a > > page is dedicated for one value for the 2 partition column. By using the > > statistics, Drill could be able to drop the page/chunk. > > But it seems Drill is not making any use of the statistics in the parquet > > file because, whatever the query I do, I don't see any change in the > number > > of page loaded. > > > > Do you confirm my conclusion? What would be the best way to organize the > > data so that Drill doesn't read the data that can be pruned easily > > > > Thanks > > Damien > > > > >