Thanks Cheng for replying. Meant to say to change number of partitions of a cached table. It doesn’t need to be re-adjusted after caching.
To provide more context: What I am seeing on my dataset is that we have a large number of tasks. Since it appears each task is mapped to a partition, I want to see if matching partitions to available core count will make it faster. I’ll give your suggestion a try to see if it will help. Experiment is a great way to learn more about spark internals. From: Cheng Lian [mailto:lian.cs....@gmail.com] Sent: Monday, March 16, 2015 5:41 AM To: Judy Nash; user@spark.apache.org Subject: Re: configure number of cached partition in memory on SparkSQL Hi Judy, In the case of HadoopRDD and NewHadoopRDD, partition number is actually decided by the InputFormat used. And spark.sql.inMemoryColumnarStorage.batchSize is not related to partition number, it controls the in-memory columnar batch size within a single partition. Also, what do you mean by “change the number of partitions after caching the table”? Are you trying to re-cache an already cached table with a different partition number? Currently, I don’t see a super intuitive pure SQL way to set the partition number in this case. Maybe you can try this (assuming table t has a column s which is expected to be sorted): SET spark.sql.shuffle.partitions = 10; CACHE TABLE cached_t AS SELECT * FROM t ORDER BY s; In this way, we introduce a shuffle by sorting a column, and zoom in/out the partition number at the same time. This might not be the best way out there, but it’s the first one that jumped into my head. Cheng On 3/5/15 3:51 AM, Judy Nash wrote: Hi, I am tuning a hive dataset on Spark SQL deployed via thrift server. How can I change the number of partitions created by caching the table on thrift server? I have tried the following but still getting the same number of partitions after caching: Spark.default.parallelism spark.sql.inMemoryColumnarStorage.batchSize Thanks, Judy