Hi nguyen,

Thank you for quick response. But what i am trying to understand is in both
query predicate evolution require only one column. So actually spark does
not need to read all column in projection if they are not used in filter
predicate. Just to give an example, amazon redshift has this kind of
optimization (
https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-redshift-introduces-late-materialization-for-faster-query-processing/
)

Thanks..


On Mar 18, 2018 8:09 PM, "nguyen duc Tuan" <newvalu...@gmail.com> wrote:

> Hi @CPC,
> Parquet is column storage format, so if you want to read data from only
> one column, you can do that without accessing all of your data. Spark SQL
> consists of a query optimizer ( see https://databricks.com/
> blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html), so it
> will optimize your query and create optimized plan to execute your query.
> Since your second query only need data from  2 columns (businesskey and
> transactionname) so it will read less data as you see.
> Hope it help you.
>
> 2018-03-19 0:02 GMT+07:00 CPC <acha...@gmail.com>:
>
>> Hi everybody,
>>
>> I try to understand how spark reading parquet files but i am confused a
>> little bit. I have a table with 4 columns and named
>> businesskey,transactionname,request and response Request and response
>> columns are huge columns(10-50kb). when i execute a query like
>> "select * from mytable where businesskey='key1'"
>> it reads whole table(2.4 tb) even though it returns 1 row. If i execute
>> "select transactionname from mytable where businesskey='key1'"
>> it reads 390gb. I expect two query to read same amount of data since it
>> filter on businesskey. In some databases this called late
>> materialization(dont read whole row if predicate eliminate it)Why first
>> query reading whole data? Do you have any idea? Spark version is 2.2 on
>> cloudera 5.12.
>>
>> Thanks in advance...
>>
>
>

Reply via email to