Re: parquet late column materialization

nguyen duc Tuan Sun, 18 Mar 2018 10:09:59 -0700

Hi @CPC,
Parquet is column storage format, so if you want to read data from only one
column, you can do that without accessing all of your data. Spark SQL
consists of a query optimizer ( see
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html),
so it will optimize your query and create optimized plan to execute your
query. Since your second query only need data from  2 columns (businesskey
and transactionname) so it will read less data as you see.
Hope it help you.


2018-03-19 0:02 GMT+07:00 CPC <acha...@gmail.com>:

> Hi everybody,
>
> I try to understand how spark reading parquet files but i am confused a
> little bit. I have a table with 4 columns and named
> businesskey,transactionname,request and response Request and response
> columns are huge columns(10-50kb). when i execute a query like
> "select * from mytable where businesskey='key1'"
> it reads whole table(2.4 tb) even though it returns 1 row. If i execute
> "select transactionname from mytable where businesskey='key1'"
> it reads 390gb. I expect two query to read same amount of data since it
> filter on businesskey. In some databases this called late
> materialization(dont read whole row if predicate eliminate it)Why first
> query reading whole data? Do you have any idea? Spark version is 2.2 on
> cloudera 5.12.
>
> Thanks in advance...
>

Re: parquet late column materialization

Reply via email to