Spark Sql reading whole table from cache instead of required coulmns

Surbhit Tue, 13 Jan 2015 01:47:47 -0800

Hi,

I am using spark 1.1.0.
I am using the spark-sql shell to run all the below queries.


I have created an external parquet table using the below SQL:

create external table daily (<15 column names>)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
LOCATION '<parquet file location>';

Then I cache the table using the following set of commands:

set spark.sql.inMemoryColumnarStorage.compressed=true;
cache table daily;
select count(*) from daily; 

The in-memory size of this table after caching is ~40 G. Complete table gets
cached in memory.

Now when I run a simple query which involves only one of the 15 columns of
this table, the whole table(~40 G) is read from the cache instead of just
one column as shown by the spark web UI. A sample query that I fired after
caching the table is:

select count(distinct col1) from daily;

I expect that only the required column should be read from the cache as the
data is stored in columnar format in cache.

Can someone please tell me if my expectation is correct. And if yes, than
what am I missing here, any configuration or something which will give me
the desired result.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Sql-reading-whole-table-from-cache-instead-of-required-coulmns-tp21113.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Sql reading whole table from cache instead of required coulmns

Reply via email to