Hi @CPC, Parquet is column storage format, so if you want to read data from only one column, you can do that without accessing all of your data. Spark SQL consists of a query optimizer ( see https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html), so it will optimize your query and create optimized plan to execute your query. Since your second query only need data from 2 columns (businesskey and transactionname) so it will read less data as you see. Hope it help you.
2018-03-19 0:02 GMT+07:00 CPC <acha...@gmail.com>: > Hi everybody, > > I try to understand how spark reading parquet files but i am confused a > little bit. I have a table with 4 columns and named > businesskey,transactionname,request and response Request and response > columns are huge columns(10-50kb). when i execute a query like > "select * from mytable where businesskey='key1'" > it reads whole table(2.4 tb) even though it returns 1 row. If i execute > "select transactionname from mytable where businesskey='key1'" > it reads 390gb. I expect two query to read same amount of data since it > filter on businesskey. In some databases this called late > materialization(dont read whole row if predicate eliminate it)Why first > query reading whole data? Do you have any idea? Spark version is 2.2 on > cloudera 5.12. > > Thanks in advance... >