Hi, I think that the speed of ORC has been improved in latest versions. Any chance you could use the latest version?
Regards, Gourav Sengupta On 17 Oct 2018 6:11 am, "daily" <asos...@foxmail.com> wrote: Hi, Spark version: 2.3.0 Hive version: 2.1.0 Best regards. ------------------ 原始邮件 ------------------ *发件人:* "Gourav Sengupta"<gourav.sengu...@gmail.com>; *发送时间:* 2018年10月16日(星期二) 晚上6:35 *收件人:* "daily"<asos...@foxmail.com>; *抄送:* "user"<user@spark.apache.org>; "dev"<d...@spark.apache.org>; *主题:* Re: SparkSQL read Hive transactional table Hi, can I please ask which version of Hive and Spark are you using? Regards, Gourav Sengupta On Tue, Oct 16, 2018 at 2:42 AM daily <asos...@foxmail.com> wrote: > Hi, > > I use HCatalog Streaming Mutation API to write data to hive transactional > table, and then, I use SparkSQL to read data from the hive transactional > table. I get the right result. > However, SparkSQL uses more time to read hive orc bucket transactional > table, beacause SparkSQL read all columns(not The columns involved in SQL) > so it uses more time. > My question is why that SparkSQL read all columns of hive orc bucket > transactional table, but not the columns involved in SQL? Is it possible to > control the SparkSQL read the columns involved in SQL? > > > > For example: > Hive Table: > create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 > int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets > stored as orc TBLPROPERTIES ('transactional'='true'); > > create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 > int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets > stored as orc TBLPROPERTIES ('transactional'='false'); > > SparkSQL: > select sum(t1),sum(t2) from dbtest.t_a1 group by t0; > select sum(t1),sum(t2) from dbtest.t_a2 group by t0; > > SparkSQL's stage Input size: > > dbtest.t_a1=113.9 GB, > > dbtest.t_a2=96.5 MB > > > > Best regards. > > > >