Re: Performance querying a single column out of a parquet file

Johannes Zillmann Mon, 11 Apr 2016 08:09:30 -0700

Hey Ted,

Sorry i mixed up row and column!


Queries are like that: 
        (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23"
        (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23"

(1) is 14 sec and (2) is 1.5 sec.
Using drill-1.6.
So it looks like Drill is extracting the columns before filtering which i 
didn’t expect…
Is there anyway to change that behaviour ?

Johannes



> On 11 Apr 2016, at 16:42, Ted Dunning <[email protected]> wrote:
> 
> Did you mean that you are doing a select to find a single column? What you
> typed was row, but that seems out of line with the rest of what you wrote.
> 
> If you are truly asking about filtering down to a single row, whether it
> costs more to return all of the columns rather than just one from a single
> row will depend on whether Drill is extracting columns before filtering or
> after.
> 
> 
> 
> On Mon, Apr 11, 2016 at 6:41 AM, Johannes Zillmann <[email protected]
>> wrote:
> 
>> Hey there,
>> 
>> i currently doing some performance measurements on Drill.
>> In my case its a single parquet file with a single local Drill Bit.
>> 
>> Now in one case i have unexpected results and i’m curious if somebody has
>> a good explanation for it!
>> 
>> So i have a file with 10 mio rows with 9 columns .
>> Now i’m doing a select statement to find one single row.
>> Runtime with select * : ~ 14.79 s
>> Runtime with select(filterField) : ~ 1.5 sec
>> 
>> So i’m surprised that there is so much variance depending on the fields i
>> select, since i thought Drill needs most time for finding that one element,
>> and then deserialize the other fields only on a hit…
>> But for deserialising 8 more hits 10 sec seem way to much!?!?!?
>> 
>> best
>> Johannes
>> 
>>

Re: Performance querying a single column out of a parquet file

Reply via email to