Re: Spark SQL: The cached columnar table is not columnar?

Xuelin Cao Thu, 08 Jan 2015 06:15:59 -0800

Hi, Cheng

      In your code:


cacheTable("tbl")
sql("select * from tbl").collect() sql("select name from tbl").collect()

     Running the first sql, the whole table is not cached yet. So the *input
data will be the original json file. *
     After it is cached, the json format data is removed, so the total
amount of data also drops.

     If you try like this:

cacheTable("tbl")
sql("select * from tbl").collect() sql("select name from tbl").collect()
sql("select * from tbl").collect()

     Is the input data of the 3rd SQL bigger than 49.1KB?




On Thu, Jan 8, 2015 at 9:36 PM, Cheng Lian <lian.cs....@gmail.com> wrote:

>  Weird, which version did you use? Just tried a small snippet in Spark
> 1.2.0 shell as follows, the result showed in the web UI meets the
> expectation quite well:
>
> import org.apache.spark.sql.SQLContextimport sc._
> val sqlContext = new SQLContext(sc)import sqlContext._
>
> jsonFile("file:///tmp/p.json").registerTempTable("tbl")
> cacheTable("tbl")
> sql("select * from tbl").collect()
> sql("select name from tbl").collect()
>
> The input data of the first statement is 292KB, the second is 49.1KB.
>
> The JSON file I used is examples/src/main/resources/people.json, I copied
> its contents multiple times to generate a larger file.
>
> Cheng
>
> On 1/8/15 7:43 PM, Xuelin Cao wrote:
>
>
>
>  Hi, Cheng
>
>       I checked the Input data for each stage. For example, in my
> attached screen snapshot, the input data is 1212.5MB, which is the total
> amount of the whole table
>
>  [image: Inline image 1]
>
>       And, I also check the input data for each task (in the stage detail
> page). And the sum of the input data for each task is also 1212.5MB
>
>
>
>
> On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian <lian.cs....@gmail.com> wrote:
>
>>  Hey Xuelin, which data item in the Web UI did you check?
>>
>>
>> On 1/7/15 5:37 PM, Xuelin Cao wrote:
>>
>>
>>  Hi,
>>
>>        Curious and curious. I'm puzzled by the Spark SQL cached table.
>>
>>        Theoretically, the cached table should be columnar table, and
>> only scan the column that included in my SQL.
>>
>>        However, in my test, I always see the whole table is scanned even
>> though I only "select" one column in my SQL.
>>
>>        Here is my code:
>>
>>
>> *val sqlContext = new org.apache.spark.sql.SQLContext(sc) *
>>
>> *import sqlContext._ *
>>
>> *sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable") *
>> *sqlContext.cacheTable("adTable")  //The table has > 10 columns*
>>
>>  *//First run, cache the table into memory*
>>  *sqlContext.sql("select * from adTable").collect*
>>
>>  *//Second run, only one column is used. It should only scan a small
>> fraction of data*
>>  *sqlContext.sql("select adId from adTable").collect *
>>
>> *sqlContext.sql("select adId from adTable").collect *
>> *sqlContext.sql("select adId from adTable").collect*
>>
>>          What I found is, every time I run the SQL, in WEB UI, it shows
>> the total amount of input data is always the same --- the total amount of
>> the table.
>>
>>          Is anything wrong? My expectation is:
>>         1. The cached table is stored as columnar table
>>         2. Since I only need one column in my SQL, the total amount of
>> input data showed in WEB UI should be very small
>>
>>          But what I found is totally not the case. Why?
>>
>>          Thanks
>>
>>
>>
>    
>

Re: Spark SQL: The cached columnar table is not columnar?

Reply via email to