How big are the files and what system are you running on?

Can you provide a Drill show files for the directory listed?


—Andries



On Jun 26, 2015, at 12:47 AM, 陈礼剑 <[email protected]> wrote:

> Hi:
> 
> 
> I have a csv file with 20,000,000 row.  And create parquet file for each 
> 1,000,000 row, which means, I will have 20 parquet files in folder 
> "/usr/download/com/togeek/data/csv/sample", now I use drill in embedded mode 
> to select:
>   SELECT * FROM dfs.`/usr/download/com/togeek/data/csv/sample` WHERE 
> Column0='1'
> 
> 
> the result should only have one row. Drill takes about 1 minute in my ENV, it 
> is too slow.
> I see the plan is:
> 00-00    Screen
> 00-01      Project(*=[$0])
> 00-02        UnionExchange
> 01-01          Project(T0¦¦*=[$0])
> 01-02            SelectionVectorRemover
> 01-03              Filter(condition=[=($1, '1')])
> 01-04                Project(T0¦¦*=[$0], Column0=[$1])
> 01-05                  Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath 
> [path=file:/usr/download/com/togeek/data/csv/sample]], 
> selectionRoot=/usr/download/com/togeek/data/csv/sample, numFiles=1, 
> columns=[`*`]]])
> 
> 
> 
> I think it means:
> first: iterator all fields from all files
> second: filter
> .....
> 
> 
> Is it right?
> ---------------------------------------------------------------------------------------
> Why not:
> 1: iterator the ID field from all files
> 2: filter, so we know which files are hitted, and the rows are hitted for the 
> file
> 3: iterator all fields from hitted files
> ....
> Even query from one single parquet file, we can also apply this rule, so only 
> few row group in parquet will be scanned. Then columar storage can get best 
> performance.
> 
> 
> Is this solution sounds correct? If yes, why we not use this solution?
> If I want to let such sql run quickly, do you have any suggestion? the detail 
> the better.
> 
> Thanks.
> 
> 
> --------Davy Chen
> 
> 
> 

Reply via email to