Hi,
I use HCatalog Streaming Mutation API to write data to hive transactional
table, and then, I use SparkSQL to read data from the hive transactional
table. I get the right result.
However, SparkSQL uses more time to read hive orc bucket transactional
table, beacause SparkSQL read all columns(not The columns involved in SQL) so
it uses more time.
My question is why that SparkSQL read all columns of hive orc bucket
transactional table, but not the columns involved in SQL? Is it possible to
control the SparkSQL read the columns involved in SQL?
For example:
Hive Table:
create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int)
partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored
as orc TBLPROPERTIES ('transactional'='true');
create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int)
partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored
as orc TBLPROPERTIES ('transactional'='false');
SparkSQL:
select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
SparkSQL's stage Input size:
dbtest.t_a1=113.9 GB,
dbtest.t_a2=96.5 MB
Best regards.