You can use EXPLAIN statement to see optimized plan for each query. ( https://stackoverflow.com/questions/35883620/spark-how-can-get-the-logical-physical-query-execution-using-thirft-hive ).
2018-03-19 0:52 GMT+07:00 CPC <acha...@gmail.com>: > Hi nguyen, > > Thank you for quick response. But what i am trying to understand is in > both query predicate evolution require only one column. So actually spark > does not need to read all column in projection if they are not used in > filter predicate. Just to give an example, amazon redshift has this kind of > optimization (https://aws.amazon.com/about-aws/whats-new/2017/12/amazon- > redshift-introduces-late-materialization-for-faster-query-processing/) > > Thanks.. > > > On Mar 18, 2018 8:09 PM, "nguyen duc Tuan" <newvalu...@gmail.com> wrote: > >> Hi @CPC, >> Parquet is column storage format, so if you want to read data from only >> one column, you can do that without accessing all of your data. Spark SQL >> consists of a query optimizer ( see https://databricks.com/blo >> g/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html), so it >> will optimize your query and create optimized plan to execute your query. >> Since your second query only need data from 2 columns (businesskey and >> transactionname) so it will read less data as you see. >> Hope it help you. >> >> 2018-03-19 0:02 GMT+07:00 CPC <acha...@gmail.com>: >> >>> Hi everybody, >>> >>> I try to understand how spark reading parquet files but i am confused a >>> little bit. I have a table with 4 columns and named >>> businesskey,transactionname,request and response Request and response >>> columns are huge columns(10-50kb). when i execute a query like >>> "select * from mytable where businesskey='key1'" >>> it reads whole table(2.4 tb) even though it returns 1 row. If i execute >>> "select transactionname from mytable where businesskey='key1'" >>> it reads 390gb. I expect two query to read same amount of data since it >>> filter on businesskey. In some databases this called late >>> materialization(dont read whole row if predicate eliminate it)Why first >>> query reading whole data? Do you have any idea? Spark version is 2.2 on >>> cloudera 5.12. >>> >>> Thanks in advance... >>> >> >>