We are adding support of partitioning in CTAS, which may help in your case.
CREATE TABLE Parquet_Table ( Column0, Column1, ...) PARTITION BY (Column0) FROM your_csv_file Then, the query would use partition pruning and see improved performance: SELECT * from Parquet_Table WHERE Column0 = '1'; The partitioning support is currently an on-going work, and is target at 1.1 release. You may have a try on the current master branch. On Fri, Jun 26, 2015 at 7:43 AM, Andries Engelbrecht < [email protected]> wrote: > How big are the files and what system are you running on? > > Can you provide a Drill show files for the directory listed? > > > —Andries > > > > On Jun 26, 2015, at 12:47 AM, 陈礼剑 <[email protected]> wrote: > > > Hi: > > > > > > I have a csv file with 20,000,000 row. And create parquet file for each > 1,000,000 row, which means, I will have 20 parquet files in folder > "/usr/download/com/togeek/data/csv/sample", now I use drill in embedded > mode to select: > > SELECT * FROM dfs.`/usr/download/com/togeek/data/csv/sample` WHERE > Column0='1' > > > > > > the result should only have one row. Drill takes about 1 minute in my > ENV, it is too slow. > > I see the plan is: > > 00-00 Screen > > 00-01 Project(*=[$0]) > > 00-02 UnionExchange > > 01-01 Project(T0¦¦*=[$0]) > > 01-02 SelectionVectorRemover > > 01-03 Filter(condition=[=($1, '1')]) > > 01-04 Project(T0¦¦*=[$0], Column0=[$1]) > > 01-05 Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath > [path=file:/usr/download/com/togeek/data/csv/sample]], > selectionRoot=/usr/download/com/togeek/data/csv/sample, numFiles=1, > columns=[`*`]]]) > > > > > > > > I think it means: > > first: iterator all fields from all files > > second: filter > > ..... > > > > > > Is it right? > > > --------------------------------------------------------------------------------------- > > Why not: > > 1: iterator the ID field from all files > > 2: filter, so we know which files are hitted, and the rows are hitted > for the file > > 3: iterator all fields from hitted files > > .... > > Even query from one single parquet file, we can also apply this rule, so > only few row group in parquet will be scanned. Then columar storage can get > best performance. > > > > > > Is this solution sounds correct? If yes, why we not use this solution? > > If I want to let such sql run quickly, do you have any suggestion? the > detail the better. > > > > Thanks. > > > > > > --------Davy Chen > > > > > > > >
