Hi, I am using spark 1.1.0. I am using the spark-sql shell to run all the below queries.
I have created an external parquet table using the below SQL: create external table daily (<15 column names>) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' LOCATION '<parquet file location>'; Then I cache the table using the following set of commands: set spark.sql.inMemoryColumnarStorage.compressed=true; cache table daily; select count(*) from daily; The in-memory size of this table after caching is ~40 G. Complete table gets cached in memory. Now when I run a simple query which involves only one of the 15 columns of this table, the whole table(~40 G) is read from the cache instead of just one column as shown by the spark web UI. A sample query that I fired after caching the table is: select count(distinct col1) from daily; I expect that only the required column should be read from the cache as the data is stored in columnar format in cache. Can someone please tell me if my expectation is correct. And if yes, than what am I missing here, any configuration or something which will give me the desired result. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Sql-reading-whole-table-from-cache-instead-of-required-coulmns-tp21113.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org