Re: Performance querying a single column out of a parquet file

Johannes Zillmann Thu, 14 Apr 2016 05:43:17 -0700

Ok, thanks for the information!

Am i right that in case DRILL-1950 would be fixed, Drill would automatically 
only materialize only those rows/columns which match the filter ?


If not so, would the late materialization you described for the filter case be 
possible to implement with the current Hooks/API ?

Johannes

> On 11 Apr 2016, at 19:36, Aman Sinha <[email protected]> wrote:
> 
> There is a JIRA related to one aspect of this: DRILL-1950 (filter pushdown
> into parquet scan).  This is still work in progress I believe.  Once that
> is implemented, the scan will produce the filtered rows only.
> 
> Regarding column projections, currently in Drill, the columns referenced
> anywhere in the query (including SELECT list) need to be produced by the
> table scan, so the scan will read all those columns, not just the ones in
> the filter condition.   You can see what columns are being produced by the
> Scan node from the EXPLAIN plan.
> 
> What would help for the SELECT * case is* late materialization of columns*.
> i.e even if the filter does not get pushed down into scan,  we could read
> only the 'id' column from the table first, do the filtering that supposedly
> selects 1 row, then do a late materialization of all other columns just for
> that 1 row by using a row-id based lookup (if the underlying storage format
> supports rowid based lookup).   This would be a feature request..I am not
> sure if a JIRA already exists for it or not.
> 
> -Aman
> 
> On Mon, Apr 11, 2016 at 9:24 AM, Ted Dunning <[email protected]> wrote:
> 
>> I just replicated these results. Full table scans with aggregation take
>> pretty much exactly the same amount of time with or without filtering.
>> 
>> 
>> 
>> On Mon, Apr 11, 2016 at 8:09 AM, Johannes Zillmann <
>> [email protected]
>>> wrote:
>> 
>>> Hey Ted,
>>> 
>>> Sorry i mixed up row and column!
>>> 
>>> Queries are like that:
>>>        (1) "SELECT * FROM dfs.`myParquetFile` WHERE `id` = 23"
>>>        (2) "SELECT id FROM dfs.`myParquetFile` WHERE `id` = 23"
>>> 
>>> (1) is 14 sec and (2) is 1.5 sec.
>>> Using drill-1.6.
>>> So it looks like Drill is extracting the columns before filtering which i
>>> didn’t expect…
>>> Is there anyway to change that behaviour ?
>>> 
>>> Johannes
>>> 
>>> 
>>> 
>>>> On 11 Apr 2016, at 16:42, Ted Dunning <[email protected]> wrote:
>>>> 
>>>> Did you mean that you are doing a select to find a single column? What
>>> you
>>>> typed was row, but that seems out of line with the rest of what you
>>> wrote.
>>>> 
>>>> If you are truly asking about filtering down to a single row, whether
>> it
>>>> costs more to return all of the columns rather than just one from a
>>> single
>>>> row will depend on whether Drill is extracting columns before filtering
>>> or
>>>> after.
>>>> 
>>>> 
>>>> 
>>>> On Mon, Apr 11, 2016 at 6:41 AM, Johannes Zillmann <
>>> [email protected]
>>>>> wrote:
>>>> 
>>>>> Hey there,
>>>>> 
>>>>> i currently doing some performance measurements on Drill.
>>>>> In my case its a single parquet file with a single local Drill Bit.
>>>>> 
>>>>> Now in one case i have unexpected results and i’m curious if somebody
>>> has
>>>>> a good explanation for it!
>>>>> 
>>>>> So i have a file with 10 mio rows with 9 columns .
>>>>> Now i’m doing a select statement to find one single row.
>>>>> Runtime with select * : ~ 14.79 s
>>>>> Runtime with select(filterField) : ~ 1.5 sec
>>>>> 
>>>>> So i’m surprised that there is so much variance depending on the
>> fields
>>> i
>>>>> select, since i thought Drill needs most time for finding that one
>>> element,
>>>>> and then deserialize the other fields only on a hit…
>>>>> But for deserialising 8 more hits 10 sec seem way to much!?!?!?
>>>>> 
>>>>> best
>>>>> Johannes
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Performance querying a single column out of a parquet file

Reply via email to