Re: Spark SQL: The cached columnar table is not columnar?

曹雪林 Wed, 07 Jan 2015 17:35:35 -0800

Thanks Michael.

2015-01-08 6:04 GMT+08:00 Michael Armbrust <mich...@databricks.com>:


> The cache command caches the entire table, with each column stored in its
> own byte buffer.  When querying the data, only the columns that you are
> asking for are scanned in memory.  I'm not sure what mechanism spark is
> using to report the amount of data read.
>
> If you want to read only the data that you are looking for off of the
> disk, I'd suggest looking at parquet.
>
> On Wed, Jan 7, 2015 at 1:37 AM, Xuelin Cao <xuelin...@yahoo.com.invalid>
> wrote:
>
>>
>> Hi,
>>
>>       Curious and curious. I'm puzzled by the Spark SQL cached table.
>>
>>       Theoretically, the cached table should be columnar table, and only
>> scan the column that included in my SQL.
>>
>>       However, in my test, I always see the whole table is scanned even
>> though I only "select" one column in my SQL.
>>
>>       Here is my code:
>>
>>
>> *val sqlContext = new org.apache.spark.sql.SQLContext(sc)*
>>
>> *import sqlContext._*
>>
>> *sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable")*
>> *sqlContext.cacheTable("adTable")  //The table has > 10 columns*
>>
>> *//First run, cache the table into memory*
>> *sqlContext.sql("select * from adTable").collect*
>>
>> *//Second run, only one column is used. It should only scan a small
>> fraction of data*
>> *sqlContext.sql("select adId from adTable").collect *
>>
>> *sqlContext.sql("select adId from adTable").collect*
>> *sqlContext.sql("select adId from adTable").collect*
>>
>>         What I found is, every time I run the SQL, in WEB UI, it shows
>> the total amount of input data is always the same --- the total amount of
>> the table.
>>
>>         Is anything wrong? My expectation is:
>>         1. The cached table is stored as columnar table
>>         2. Since I only need one column in my SQL, the total amount of
>> input data showed in WEB UI should be very small
>>
>>         But what I found is totally not the case. Why?
>>
>>         Thanks
>>
>>
>

Re: Spark SQL: The cached columnar table is not columnar?

Reply via email to