Thanks Michael. 2015-01-08 6:04 GMT+08:00 Michael Armbrust <mich...@databricks.com>:
> The cache command caches the entire table, with each column stored in its > own byte buffer. When querying the data, only the columns that you are > asking for are scanned in memory. I'm not sure what mechanism spark is > using to report the amount of data read. > > If you want to read only the data that you are looking for off of the > disk, I'd suggest looking at parquet. > > On Wed, Jan 7, 2015 at 1:37 AM, Xuelin Cao <xuelin...@yahoo.com.invalid> > wrote: > >> >> Hi, >> >> Curious and curious. I'm puzzled by the Spark SQL cached table. >> >> Theoretically, the cached table should be columnar table, and only >> scan the column that included in my SQL. >> >> However, in my test, I always see the whole table is scanned even >> though I only "select" one column in my SQL. >> >> Here is my code: >> >> >> *val sqlContext = new org.apache.spark.sql.SQLContext(sc)* >> >> *import sqlContext._* >> >> *sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable")* >> *sqlContext.cacheTable("adTable") //The table has > 10 columns* >> >> *//First run, cache the table into memory* >> *sqlContext.sql("select * from adTable").collect* >> >> *//Second run, only one column is used. It should only scan a small >> fraction of data* >> *sqlContext.sql("select adId from adTable").collect * >> >> *sqlContext.sql("select adId from adTable").collect* >> *sqlContext.sql("select adId from adTable").collect* >> >> What I found is, every time I run the SQL, in WEB UI, it shows >> the total amount of input data is always the same --- the total amount of >> the table. >> >> Is anything wrong? My expectation is: >> 1. The cached table is stored as columnar table >> 2. Since I only need one column in my SQL, the total amount of >> input data showed in WEB UI should be very small >> >> But what I found is totally not the case. Why? >> >> Thanks >> >> >